All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/18] Add Cgroup support for SGX EPC memory
@ 2023-09-23  3:06 ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

SGX EPC memory allocations are separate from normal RAM allocations, and
are managed solely by the SGX subsystem. The existing cgroup memory
controller cannot be used to limit or account for SGX EPC memory, which is
a desirable feature in some environments, e.g., support for pod level
control in a Kubernates cluster on a VM or baremetal host [1,2].
 
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. The user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
 
This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to work with more recent
kernels, and to utilize the misc cgroup controller rather than a custom
controller. Now I have been updating the patches based on review comments
since V2 [3, 4], simplified a few aspects of the implementation/design and
fixed some stability issues found from testing, while keeping the same user
space facing interfaces.
 
The patchset adds support for multiple LRU lists to track both reclaimable
EPC pages (i.e., pages the reclaimer knows about), as well as unreclaimable
EPC pages (i.e., pages which the reclaimer isn't aware of, such as VA
pages).  These pages are assigned to an LRU list, as well as an enclave, so
that an enclave's full EPC usage can be tracked, and subject to the
per-cgroup limit. During OOM events, an enclave can have its memory zapped,
and all the EPC pages tracked by the LRU lists can be freed.
 
The EPC pages allocated for KVM guests by the virtual EPC driver are not
reclaimable by the host kernel [5]. Therefore they are not tracked by any
LRU lists for reclaiming purposes in this implementation, but they are
charged toward the cgroup of the user processs (e.g., QEMU) launching the
guest.  And when the cgroup  EPC usage reaches its limit, the virtual EPC
driver will stop allocating more EPC for the VM, and return SIGBUS to the
user process which would abort the VM launch.
 
To make it easier to follow, I reordered the patches in v4 into following
clusters:
- Patches 1&2 are prerequisite  misc cgroup changes
- Patches 3-8 deal with the 'reclaimable' pages
- Patches 9-12 deal with the 'unreclaimable' pages, which are freed only
  for OOM scenarios.
- Patches 13-15 re-organize EPC reclaiming code to be reusable by EPC
  cgroup.
- Patch 16 implements EPC cgroup as a misc cgroup.
- Patch 17 adds documentation for the EPC cgroup.
- Patch 18 adds test scripts.

I appreciate your review and providing tags if appropriate.

---
v5:
- Replace the manual test script with a selftest script.
- Restore the "From" tag for some patches to Sean [Kai]
- Style fixes (Jarkko)

v4:
* Collected "Tested-by" from Mikko. I kept it for now as no functional changes in v4.
* Rebased on to v6.6_rc1 and reordered patches as described above.
* Separated out the bug fixes [7,8,9]. This series depend on those patches. (Dave, Jarkko)
* Added comments in commit message to give more preview what's to come next. (Jarkko)
* Fixed some documentation error, gap, style (Mikko, Randy)
* Fixed some comments, typo, style in code (Mikko, Kai)
* Patch format and background for reclaimable vs unreclaimable (Kai, Jarkko)
* Fixed typo (Pavel)
* Exclude the previous fixes/enhancements for self-tests. Patch 18 now depends on series [6]
* Use the same to list for cover and all patches. (Sohil)
 
v3:
 
* Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
* Unrolled wrappers for cond_resched, list (Dave)
* Separate patches for adding reclaimable and unreclaimable lists. (Dave)
* Other improvments on patch flow, commit messages, styles. (Dave, Jarkko)
* Simplified the cgroup tree walking with plain
  css_for_each_descendant_pre.
* Fixed race conditions and crashes.
* OOM killer to wait for the victim enclave pages being reclaimed.
* Unblock the user by handling misc_max_write callback asynchronously.
* Rebased onto 6.4 and no longer base this series on the MCA patchset.
* Fix an overflow in misc_try_charge.
* Fix a NULL pointer in SGX PF handler.
* Updated and included the SGX selftest patches previously reviewed. Those
  patches fix issues triggered in high EPC pressure required for cgroup
  testing.
* Added test scripts to help setup and test SGX EPC cgroups.
 
[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/
[4]https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang@linux.intel.com/
[5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[6]https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko@kernel.org/
[7]https://lore.kernel.org/linux-sgx/ZLcXmvDKheCRYOjG@slm.duckdns.org/
[8]https://lore.kernel.org/linux-sgx/20230721120231.13916-1-haitao.huang@linux.intel.com/
[9]https://lore.kernel.org/linux-sgx/20230728051024.33063-1-haitao.huang@linux.intel.com/

Haitao Huang (2):
  x86/sgx: Introduce EPC page states
  selftests/sgx: Add scripts for EPC cgroup testing

Kristen Carlson Accardi (3):
  cgroup/misc: Add per resource callbacks for CSS events
  cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
  x86/sgx: Limit process EPC usage with misc cgroup controller

Sean Christopherson (13):
  x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
  x86/sgx: Use sgx_epc_lru_lists for existing active page list
  x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists
  x86/sgx: Introduce RECLAIM_IN_PROGRESS state
  x86/sgx: Use a list to track to-be-reclaimed pages
  x86/sgx: Store struct sgx_encl when allocating new VA pages
  x86/sgx: Add EPC page flags to identify owner types
  x86/sgx: store unreclaimable pages in LRU lists
  x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
  x86/sgx: Prepare for multiple LRUs
  Docs/x86/sgx: Add description for cgroup support

 Documentation/arch/x86/sgx.rst                |  82 ++++
 arch/x86/Kconfig                              |  13 +
 arch/x86/kernel/cpu/sgx/Makefile              |   1 +
 arch/x86/kernel/cpu/sgx/driver.c              |  27 +-
 arch/x86/kernel/cpu/sgx/encl.c                |  72 ++-
 arch/x86/kernel/cpu/sgx/encl.h                |   4 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c          | 415 ++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h          |  59 +++
 arch/x86/kernel/cpu/sgx/ioctl.c               |  25 +-
 arch/x86/kernel/cpu/sgx/main.c                | 399 +++++++++++++----
 arch/x86/kernel/cpu/sgx/sgx.h                 | 117 ++++-
 include/linux/misc_cgroup.h                   |  34 ++
 kernel/cgroup/misc.c                          |  57 ++-
 .../selftests/sgx/run_epc_cg_selftests.sh     | 147 +++++++
 .../selftests/sgx/watch_misc_for_tests.sh     |  13 +
 15 files changed, 1314 insertions(+), 151 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

-- 
2.25.1


^ permalink raw reply	[flat|nested] 144+ messages in thread

* [PATCH v5 00/18] Add Cgroup support for SGX EPC memory
@ 2023-09-23  3:06 ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

SGX EPC memory allocations are separate from normal RAM allocations, and
are managed solely by the SGX subsystem. The existing cgroup memory
controller cannot be used to limit or account for SGX EPC memory, which is
a desirable feature in some environments, e.g., support for pod level
control in a Kubernates cluster on a VM or baremetal host [1,2].
 
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. The user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
 
This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to work with more recent
kernels, and to utilize the misc cgroup controller rather than a custom
controller. Now I have been updating the patches based on review comments
since V2 [3, 4], simplified a few aspects of the implementation/design and
fixed some stability issues found from testing, while keeping the same user
space facing interfaces.
 
The patchset adds support for multiple LRU lists to track both reclaimable
EPC pages (i.e., pages the reclaimer knows about), as well as unreclaimable
EPC pages (i.e., pages which the reclaimer isn't aware of, such as VA
pages).  These pages are assigned to an LRU list, as well as an enclave, so
that an enclave's full EPC usage can be tracked, and subject to the
per-cgroup limit. During OOM events, an enclave can have its memory zapped,
and all the EPC pages tracked by the LRU lists can be freed.
 
The EPC pages allocated for KVM guests by the virtual EPC driver are not
reclaimable by the host kernel [5]. Therefore they are not tracked by any
LRU lists for reclaiming purposes in this implementation, but they are
charged toward the cgroup of the user processs (e.g., QEMU) launching the
guest.  And when the cgroup  EPC usage reaches its limit, the virtual EPC
driver will stop allocating more EPC for the VM, and return SIGBUS to the
user process which would abort the VM launch.
 
To make it easier to follow, I reordered the patches in v4 into following
clusters:
- Patches 1&2 are prerequisite  misc cgroup changes
- Patches 3-8 deal with the 'reclaimable' pages
- Patches 9-12 deal with the 'unreclaimable' pages, which are freed only
  for OOM scenarios.
- Patches 13-15 re-organize EPC reclaiming code to be reusable by EPC
  cgroup.
- Patch 16 implements EPC cgroup as a misc cgroup.
- Patch 17 adds documentation for the EPC cgroup.
- Patch 18 adds test scripts.

I appreciate your review and providing tags if appropriate.

---
v5:
- Replace the manual test script with a selftest script.
- Restore the "From" tag for some patches to Sean [Kai]
- Style fixes (Jarkko)

v4:
* Collected "Tested-by" from Mikko. I kept it for now as no functional changes in v4.
* Rebased on to v6.6_rc1 and reordered patches as described above.
* Separated out the bug fixes [7,8,9]. This series depend on those patches. (Dave, Jarkko)
* Added comments in commit message to give more preview what's to come next. (Jarkko)
* Fixed some documentation error, gap, style (Mikko, Randy)
* Fixed some comments, typo, style in code (Mikko, Kai)
* Patch format and background for reclaimable vs unreclaimable (Kai, Jarkko)
* Fixed typo (Pavel)
* Exclude the previous fixes/enhancements for self-tests. Patch 18 now depends on series [6]
* Use the same to list for cover and all patches. (Sohil)
 
v3:
 
* Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
* Unrolled wrappers for cond_resched, list (Dave)
* Separate patches for adding reclaimable and unreclaimable lists. (Dave)
* Other improvments on patch flow, commit messages, styles. (Dave, Jarkko)
* Simplified the cgroup tree walking with plain
  css_for_each_descendant_pre.
* Fixed race conditions and crashes.
* OOM killer to wait for the victim enclave pages being reclaimed.
* Unblock the user by handling misc_max_write callback asynchronously.
* Rebased onto 6.4 and no longer base this series on the MCA patchset.
* Fix an overflow in misc_try_charge.
* Fix a NULL pointer in SGX PF handler.
* Updated and included the SGX selftest patches previously reviewed. Those
  patches fix issues triggered in high EPC pressure required for cgroup
  testing.
* Added test scripts to help setup and test SGX EPC cgroups.
 
[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989-3jJ4emchXQBypjffeKSS3s1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org/
[4]https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org/
[5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[6]https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org/
[7]https://lore.kernel.org/linux-sgx/ZLcXmvDKheCRYOjG-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org/
[8]https://lore.kernel.org/linux-sgx/20230721120231.13916-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org/
[9]https://lore.kernel.org/linux-sgx/20230728051024.33063-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org/

Haitao Huang (2):
  x86/sgx: Introduce EPC page states
  selftests/sgx: Add scripts for EPC cgroup testing

Kristen Carlson Accardi (3):
  cgroup/misc: Add per resource callbacks for CSS events
  cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
  x86/sgx: Limit process EPC usage with misc cgroup controller

Sean Christopherson (13):
  x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
  x86/sgx: Use sgx_epc_lru_lists for existing active page list
  x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists
  x86/sgx: Introduce RECLAIM_IN_PROGRESS state
  x86/sgx: Use a list to track to-be-reclaimed pages
  x86/sgx: Store struct sgx_encl when allocating new VA pages
  x86/sgx: Add EPC page flags to identify owner types
  x86/sgx: store unreclaimable pages in LRU lists
  x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
  x86/sgx: Prepare for multiple LRUs
  Docs/x86/sgx: Add description for cgroup support

 Documentation/arch/x86/sgx.rst                |  82 ++++
 arch/x86/Kconfig                              |  13 +
 arch/x86/kernel/cpu/sgx/Makefile              |   1 +
 arch/x86/kernel/cpu/sgx/driver.c              |  27 +-
 arch/x86/kernel/cpu/sgx/encl.c                |  72 ++-
 arch/x86/kernel/cpu/sgx/encl.h                |   4 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c          | 415 ++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h          |  59 +++
 arch/x86/kernel/cpu/sgx/ioctl.c               |  25 +-
 arch/x86/kernel/cpu/sgx/main.c                | 399 +++++++++++++----
 arch/x86/kernel/cpu/sgx/sgx.h                 | 117 ++++-
 include/linux/misc_cgroup.h                   |  34 ++
 kernel/cgroup/misc.c                          |  57 ++-
 .../selftests/sgx/run_epc_cg_selftests.sh     | 147 +++++++
 .../selftests/sgx/watch_misc_for_tests.sh     |  13 +
 15 files changed, 1314 insertions(+), 151 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

-- 
2.25.1


^ permalink raw reply	[flat|nested] 144+ messages in thread

* [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Kristen Carlson Accardi <kristen@linux.intel.com>

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed, or in event of user writing the max value to
the misc.max file to set the usage limit of a specific resource
[admin-guide/cgroup-v2.rst, 5-9. Misc].

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver:
- On css_alloc, allocate and initialize necessary structures for EPC
reclaiming, e.g., LRU list, work queue, etc.
- On css_free, cleanup and free those structures created in alloc.
- On max_write, trigger EPC reclaiming if the new limit is at or below
current usage.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
V5:
- Remove prefixes from the callback names (tj)
- Update commit message (Jarkko)

V4:
- Moved this to the front of the series.
- Applies on cgroup/for-6.6 with the overflow fix for misc.

V3:
- Removed the released() callback
---
 include/linux/misc_cgroup.h |  5 +++++
 kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..96a88822815a 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -37,6 +37,11 @@ struct misc_res {
 	u64 max;
 	atomic64_t usage;
 	atomic64_t events;
+
+	/* per resource callback ops */
+	int (*alloc)(struct misc_cg *cg);
+	void (*free)(struct misc_cg *cg);
+	void (*max_write)(struct misc_cg *cg);
 };
 
 /**
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..62c9198dee21 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
 
 	cg = css_misc(of_css(of));
 
-	if (READ_ONCE(misc_res_capacity[type]))
+	if (READ_ONCE(misc_res_capacity[type])) {
 		WRITE_ONCE(cg->res[type].max, max);
-	else
+		if (cg->res[type].max_write)
+			cg->res[type].max_write(cg);
+	} else {
 		ret = -EINVAL;
+	}
 
 	return ret ? ret : nbytes;
 }
@@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
+	struct misc_cg *parent_cg;
 	enum misc_res_type i;
 	struct misc_cg *cg;
+	int ret;
 
 	if (!parent_css) {
 		cg = &root_cg;
+		parent_cg = &root_cg;
 	} else {
 		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
 		if (!cg)
 			return ERR_PTR(-ENOMEM);
+		parent_cg = css_misc(parent_css);
 	}
 
 	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
 		WRITE_ONCE(cg->res[i].max, MAX_NUM);
 		atomic64_set(&cg->res[i].usage, 0);
+		if (parent_cg->res[i].alloc) {
+			ret = parent_cg->res[i].alloc(cg);
+			if (ret)
+				goto alloc_err;
+		}
 	}
 
 	return &cg->css;
+
+alloc_err:
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (parent_cg->res[i].free)
+			cg->res[i].free(cg);
+	kfree(cg);
+	return ERR_PTR(ret);
 }
 
 /**
@@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
  */
 static void misc_cg_free(struct cgroup_subsys_state *css)
 {
-	kfree(css_misc(css));
+	struct misc_cg *cg = css_misc(css);
+	enum misc_res_type i;
+
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (cg->res[i].free)
+			cg->res[i].free(cg);
+
+	kfree(cg);
 }
 
 /* Cgroup controller callbacks */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed, or in event of user writing the max value to
the misc.max file to set the usage limit of a specific resource
[admin-guide/cgroup-v2.rst, 5-9. Misc].

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver:
- On css_alloc, allocate and initialize necessary structures for EPC
reclaiming, e.g., LRU list, work queue, etc.
- On css_free, cleanup and free those structures created in alloc.
- On max_write, trigger EPC reclaiming if the new limit is at or below
current usage.

Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
V5:
- Remove prefixes from the callback names (tj)
- Update commit message (Jarkko)

V4:
- Moved this to the front of the series.
- Applies on cgroup/for-6.6 with the overflow fix for misc.

V3:
- Removed the released() callback
---
 include/linux/misc_cgroup.h |  5 +++++
 kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..96a88822815a 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -37,6 +37,11 @@ struct misc_res {
 	u64 max;
 	atomic64_t usage;
 	atomic64_t events;
+
+	/* per resource callback ops */
+	int (*alloc)(struct misc_cg *cg);
+	void (*free)(struct misc_cg *cg);
+	void (*max_write)(struct misc_cg *cg);
 };
 
 /**
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..62c9198dee21 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
 
 	cg = css_misc(of_css(of));
 
-	if (READ_ONCE(misc_res_capacity[type]))
+	if (READ_ONCE(misc_res_capacity[type])) {
 		WRITE_ONCE(cg->res[type].max, max);
-	else
+		if (cg->res[type].max_write)
+			cg->res[type].max_write(cg);
+	} else {
 		ret = -EINVAL;
+	}
 
 	return ret ? ret : nbytes;
 }
@@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
+	struct misc_cg *parent_cg;
 	enum misc_res_type i;
 	struct misc_cg *cg;
+	int ret;
 
 	if (!parent_css) {
 		cg = &root_cg;
+		parent_cg = &root_cg;
 	} else {
 		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
 		if (!cg)
 			return ERR_PTR(-ENOMEM);
+		parent_cg = css_misc(parent_css);
 	}
 
 	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
 		WRITE_ONCE(cg->res[i].max, MAX_NUM);
 		atomic64_set(&cg->res[i].usage, 0);
+		if (parent_cg->res[i].alloc) {
+			ret = parent_cg->res[i].alloc(cg);
+			if (ret)
+				goto alloc_err;
+		}
 	}
 
 	return &cg->css;
+
+alloc_err:
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (parent_cg->res[i].free)
+			cg->res[i].free(cg);
+	kfree(cg);
+	return ERR_PTR(ret);
 }
 
 /**
@@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
  */
 static void misc_cg_free(struct cgroup_subsys_state *css)
 {
-	kfree(css_misc(css));
+	struct misc_cg *cg = css_misc(css);
+	enum misc_res_type i;
+
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (cg->res[i].free)
+			cg->res[i].free(cg);
+
+	kfree(cg);
 }
 
 /* Cgroup controller callbacks */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Kristen Carlson Accardi <kristen@linux.intel.com>

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Add per resource type private data so that SGX can store additional per
cgroup data in misc_cg->misc_cg_res[MISC_CG_RES_SGX_EPC].

Export misc_cg_root() so the SGX driver can initialize and add those
additional structures to the root misc cgroup as part of initialization
for EPC cgroup support. This bootstraps the same additional
initialization for non-root cgroups in the 'alloc()' callback added in the
previous patch.

The SGX driver, as the EPC memory provider, will have a background
worker to reclaim EPC pages to make room for new allocations in the same
cgroup when its usage counter reaches near the limit controlled by the
cgroup and its ancestors. Therefore it needs to do a walk from the
current cgroup up to the root. To enable this walk, move parent_misc()
into misc_cgroup.h and make inline to make this function available to
SGX, rename it to misc_cg_parent(), and update kernel/cgroup/misc.c to
use the new name.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
V5:
- Revised commit message (Jarkko)

V4:
- Moved this to the second in the series.
---
 include/linux/misc_cgroup.h | 29 +++++++++++++++++++++++++++++
 kernel/cgroup/misc.c        | 25 ++++++++++++-------------
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 96a88822815a..87f29f8597e1 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
 	MISC_CG_RES_SEV,
 	/* AMD SEV-ES ASIDs resource */
 	MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+	/* SGX EPC memory resource */
+	MISC_CG_RES_SGX_EPC,
 #endif
 	MISC_CG_RES_TYPES
 };
@@ -37,6 +41,7 @@ struct misc_res {
 	u64 max;
 	atomic64_t usage;
 	atomic64_t events;
+	void *priv;
 
 	/* per resource callback ops */
 	int (*alloc)(struct misc_cg *cg);
@@ -59,6 +64,7 @@ struct misc_cg {
 	struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
@@ -78,6 +84,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
 	return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -102,6 +122,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+	return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+	return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 62c9198dee21..4633b8629e63 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
 	/* AMD SEV-ES ASIDs resource */
 	"sev_es",
 #endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+	/* Intel SGX EPC memory bytes */
+	"sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
@@ -40,18 +44,13 @@ static struct misc_cg root_cg;
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+	return &root_cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -150,7 +149,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	if (!amount)
 		return 0;
 
-	for (i = cg; i; i = parent_misc(i)) {
+	for (i = cg; i; i = misc_cg_parent(i)) {
 		res = &i->res[type];
 
 		new_usage = atomic64_add_return(amount, &res->usage);
@@ -163,12 +162,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	return 0;
 
 err_charge:
-	for (j = i; j; j = parent_misc(j)) {
+	for (j = i; j; j = misc_cg_parent(j)) {
 		atomic64_inc(&j->res[type].events);
 		cgroup_file_notify(&j->events_file);
 	}
 
-	for (j = cg; j != i; j = parent_misc(j))
+	for (j = cg; j != i; j = misc_cg_parent(j))
 		misc_cg_cancel_charge(type, j, amount);
 	misc_cg_cancel_charge(type, i, amount);
 	return ret;
@@ -190,7 +189,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	if (!(amount && valid_type(type) && cg))
 		return;
 
-	for (i = cg; i; i = parent_misc(i))
+	for (i = cg; i; i = misc_cg_parent(i))
 		misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Add per resource type private data so that SGX can store additional per
cgroup data in misc_cg->misc_cg_res[MISC_CG_RES_SGX_EPC].

Export misc_cg_root() so the SGX driver can initialize and add those
additional structures to the root misc cgroup as part of initialization
for EPC cgroup support. This bootstraps the same additional
initialization for non-root cgroups in the 'alloc()' callback added in the
previous patch.

The SGX driver, as the EPC memory provider, will have a background
worker to reclaim EPC pages to make room for new allocations in the same
cgroup when its usage counter reaches near the limit controlled by the
cgroup and its ancestors. Therefore it needs to do a walk from the
current cgroup up to the root. To enable this walk, move parent_misc()
into misc_cgroup.h and make inline to make this function available to
SGX, rename it to misc_cg_parent(), and update kernel/cgroup/misc.c to
use the new name.

Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
V5:
- Revised commit message (Jarkko)

V4:
- Moved this to the second in the series.
---
 include/linux/misc_cgroup.h | 29 +++++++++++++++++++++++++++++
 kernel/cgroup/misc.c        | 25 ++++++++++++-------------
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 96a88822815a..87f29f8597e1 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
 	MISC_CG_RES_SEV,
 	/* AMD SEV-ES ASIDs resource */
 	MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+	/* SGX EPC memory resource */
+	MISC_CG_RES_SGX_EPC,
 #endif
 	MISC_CG_RES_TYPES
 };
@@ -37,6 +41,7 @@ struct misc_res {
 	u64 max;
 	atomic64_t usage;
 	atomic64_t events;
+	void *priv;
 
 	/* per resource callback ops */
 	int (*alloc)(struct misc_cg *cg);
@@ -59,6 +64,7 @@ struct misc_cg {
 	struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
@@ -78,6 +84,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
 	return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -102,6 +122,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+	return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+	return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 62c9198dee21..4633b8629e63 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
 	/* AMD SEV-ES ASIDs resource */
 	"sev_es",
 #endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+	/* Intel SGX EPC memory bytes */
+	"sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
@@ -40,18 +44,13 @@ static struct misc_cg root_cg;
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+	return &root_cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -150,7 +149,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	if (!amount)
 		return 0;
 
-	for (i = cg; i; i = parent_misc(i)) {
+	for (i = cg; i; i = misc_cg_parent(i)) {
 		res = &i->res[type];
 
 		new_usage = atomic64_add_return(amount, &res->usage);
@@ -163,12 +162,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	return 0;
 
 err_charge:
-	for (j = i; j; j = parent_misc(j)) {
+	for (j = i; j; j = misc_cg_parent(j)) {
 		atomic64_inc(&j->res[type].events);
 		cgroup_file_notify(&j->events_file);
 	}
 
-	for (j = cg; j != i; j = parent_misc(j))
+	for (j = cg; j != i; j = misc_cg_parent(j))
 		misc_cg_cancel_charge(type, j, amount);
 	misc_cg_cancel_charge(type, i, amount);
 	return ret;
@@ -190,7 +189,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	if (!(amount && valid_type(type) && cg))
 		return;
 
-	for (i = cg; i; i = parent_misc(i))
+	for (i = cg; i; i = misc_cg_parent(i))
 		misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
  2023-09-23  3:06 ` Haitao Huang
                   ` (2 preceding siblings ...)
  (?)
@ 2023-09-23  3:06 ` Haitao Huang
  -1 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Currently, ksgxd does not track the VA, SECS pages. They are considered
as 'unreclaimable' pages that are only deallocated when their respective
owning enclaves are destroyed and all associated resources released.

When an EPC cgroup can not reclaim any more reclaimable EPC pages to
reduce its usage below its limit, the cgroup must also reclaim those
unreclaimables by killing their owning enclaves. The VA and SECS pages
later are also tracked in an 'unreclaimable' list added to this structure
to support this OOM killing of enclaves.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
 arch/x86/kernel/cpu/sgx/sgx.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index d2dad21259a8..018414b2abe8 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -83,6 +83,20 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 	return section->virt_addr + index * PAGE_SIZE;
 }
 
+/*
+ * Tracks EPC pages reclaimable by the reclaimer (ksgxd).
+ */
+struct sgx_epc_lru_lists {
+	spinlock_t lock;
+	struct list_head reclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
+{
+	spin_lock_init(&lrus->lock);
+	INIT_LIST_HEAD(&lrus->reclaimable);
+}
+
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

All EPC pages of enclaves including Version Array (VA) and SGX Enclave
Control Structure (SECS) will be tracked in sgx_epc_lru_lists structs,
one per cgroup. For now just replace the existing sgx_active_page_list
in the reclaimer and its spinlock with a global sgx_epc_lru_lists
struct. VA and SECS pages are still not tracked at this point but they
will be tracked after an unreclaimable LRU list is added to the
sgx_epc_lru_lists struct.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V5:
- Spelled out SECS, VA (Jarkko)

V4:
- No change, only reordered the patch.

V3:
- Remove usage of list wrapper
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..afce51d6e94a 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -26,10 +26,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_lists sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -304,13 +303,13 @@ static void sgx_reclaim_pages(void)
 	int ret;
 	int i;
 
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		if (list_empty(&sgx_active_page_list))
+		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+						    struct sgx_epc_page, list);
+		if (!epc_page)
 			break;
 
-		epc_page = list_first_entry(&sgx_active_page_list,
-					    struct sgx_epc_page, list);
 		list_del_init(&epc_page->list);
 		encl_page = epc_page->owner;
 
@@ -322,7 +321,7 @@ static void sgx_reclaim_pages(void)
 			 */
 			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	for (i = 0; i < cnt; i++) {
 		epc_page = chunk[i];
@@ -345,9 +344,9 @@ static void sgx_reclaim_pages(void)
 		continue;
 
 skip:
-		spin_lock(&sgx_reclaimer_lock);
-		list_add_tail(&epc_page->list, &sgx_active_page_list);
-		spin_unlock(&sgx_reclaimer_lock);
+		spin_lock(&sgx_global_lru.lock);
+		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 
@@ -378,7 +377,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_active_page_list);
+	       !list_empty(&sgx_global_lru.reclaimable);
 }
 
 /*
@@ -430,6 +429,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
 	ksgxd_tsk = tsk;
 
+	sgx_lru_init(&sgx_global_lru);
+
 	return true;
 }
 
@@ -505,10 +506,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_active_page_list);
-	spin_unlock(&sgx_reclaimer_lock);
+	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
@@ -523,18 +524,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_reclaimer_lock);
+			spin_unlock(&sgx_global_lru.lock);
 			return -EBUSY;
 		}
 
 		list_del(&page->list);
 		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
 }
@@ -567,7 +568,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_active_page_list))
+		if (list_empty(&sgx_global_lru.reclaimable))
 			return ERR_PTR(-ENOMEM);
 
 		if (!reclaim) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

All EPC pages of enclaves including Version Array (VA) and SGX Enclave
Control Structure (SECS) will be tracked in sgx_epc_lru_lists structs,
one per cgroup. For now just replace the existing sgx_active_page_list
in the reclaimer and its spinlock with a global sgx_epc_lru_lists
struct. VA and SECS pages are still not tracked at this point but they
will be tracked after an unreclaimable LRU list is added to the
sgx_epc_lru_lists struct.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V5:
- Spelled out SECS, VA (Jarkko)

V4:
- No change, only reordered the patch.

V3:
- Remove usage of list wrapper
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..afce51d6e94a 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -26,10 +26,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_lists sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -304,13 +303,13 @@ static void sgx_reclaim_pages(void)
 	int ret;
 	int i;
 
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		if (list_empty(&sgx_active_page_list))
+		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+						    struct sgx_epc_page, list);
+		if (!epc_page)
 			break;
 
-		epc_page = list_first_entry(&sgx_active_page_list,
-					    struct sgx_epc_page, list);
 		list_del_init(&epc_page->list);
 		encl_page = epc_page->owner;
 
@@ -322,7 +321,7 @@ static void sgx_reclaim_pages(void)
 			 */
 			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	for (i = 0; i < cnt; i++) {
 		epc_page = chunk[i];
@@ -345,9 +344,9 @@ static void sgx_reclaim_pages(void)
 		continue;
 
 skip:
-		spin_lock(&sgx_reclaimer_lock);
-		list_add_tail(&epc_page->list, &sgx_active_page_list);
-		spin_unlock(&sgx_reclaimer_lock);
+		spin_lock(&sgx_global_lru.lock);
+		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 
@@ -378,7 +377,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_active_page_list);
+	       !list_empty(&sgx_global_lru.reclaimable);
 }
 
 /*
@@ -430,6 +429,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
 	ksgxd_tsk = tsk;
 
+	sgx_lru_init(&sgx_global_lru);
+
 	return true;
 }
 
@@ -505,10 +506,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_active_page_list);
-	spin_unlock(&sgx_reclaimer_lock);
+	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
@@ -523,18 +524,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_reclaimer_lock);
+			spin_unlock(&sgx_global_lru.lock);
 			return -EBUSY;
 		}
 
 		list_del(&page->list);
 		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
 }
@@ -567,7 +568,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_active_page_list))
+		if (list_empty(&sgx_global_lru.reclaimable))
 			return ERR_PTR(-ENOMEM);
 
 		if (!reclaim) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 05/18] x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

Replace sgx_mark_page_reclaimable() and sgx_unmark_page_reclaimable()
with sgx_record_epc_page() and sgx_drop_epc_page(). The
sgx_record_epc_page() function adds the epc_page to the "reclaimable"
list in the sgx_epc_lru_lists struct, while sgx_drop_epc_page() removes
the page from the LRU list.

For now, this change serves as a straightforward replacement of the two
functions for pages tracked by the reclaimer. When the unreclaimable
list is added to track VA and SECS pages for cgroups, these functions
will be updated to add/remove them from the unreclaimable lists.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V5:
- style fixes (Jarkko)

V4:
- Code update needed for patch reordering
- Revised commit message.
---
 arch/x86/kernel/cpu/sgx/encl.c  |  6 +++---
 arch/x86/kernel/cpu/sgx/ioctl.c |  8 ++++----
 arch/x86/kernel/cpu/sgx/main.c  | 22 ++++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h   |  4 ++--
 4 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..97a53e34a8b4 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -272,7 +272,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_mark_page_reclaimable(entry->epc_page);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 
 	return entry;
 }
@@ -398,7 +398,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl_page->type = SGX_PAGE_TYPE_REG;
 	encl->secs_child_cnt++;
 
-	sgx_mark_page_reclaimable(encl_page->epc_page);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -714,7 +714,7 @@ void sgx_encl_release(struct kref *ref)
 			 * The page and its radix tree entry cannot be freed
 			 * if the page is being held by the reclaimer.
 			 */
-			if (sgx_unmark_page_reclaimable(entry->epc_page))
+			if (sgx_drop_epc_page(entry->epc_page))
 				continue;
 
 			sgx_encl_free_epc_page(entry->epc_page);
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 5d390df21440..a75eb44022a3 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -322,7 +322,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_mark_page_reclaimable(encl_page->epc_page);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -961,7 +961,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 			 * Prevent page from being reclaimed while mutex
 			 * is released.
 			 */
-			if (sgx_unmark_page_reclaimable(entry->epc_page)) {
+			if (sgx_drop_epc_page(entry->epc_page)) {
 				ret = -EAGAIN;
 				goto out_entry_changed;
 			}
@@ -976,7 +976,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 
 			mutex_lock(&encl->lock);
 
-			sgx_mark_page_reclaimable(entry->epc_page);
+			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 		}
 
 		/* Change EPC type */
@@ -1133,7 +1133,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
 			goto out_unlock;
 		}
 
-		if (sgx_unmark_page_reclaimable(entry->epc_page)) {
+		if (sgx_drop_epc_page(entry->epc_page)) {
 			ret = -EBUSY;
 			goto out_unlock;
 		}
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index afce51d6e94a..dec1d57cbff6 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -268,7 +268,6 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 			goto out;
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
-
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
@@ -498,31 +497,34 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_record_epc_page() - Add a page to the appropriate LRU list
  * @page:	EPC page
+ * @flags:	The type of page that is being recorded
  *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
+ * Mark a page with the specified flags and add it to the appropriate
+ * list.
  */
-void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
+void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
 	spin_lock(&sgx_global_lru.lock);
-	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	page->flags |= flags;
+	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
+		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_drop_epc_page() - Remove a page from a LRU list
  * @page:	EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
  *   -EBUSY if the page is in the process of being reclaimed
  */
-int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
+int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 018414b2abe8..113d930fd087 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -101,8 +101,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
 void sgx_reclaim_direct(void);
-void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
-int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
+void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
+int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 
 void sgx_ipi_cb(void *info);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 05/18] x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Replace sgx_mark_page_reclaimable() and sgx_unmark_page_reclaimable()
with sgx_record_epc_page() and sgx_drop_epc_page(). The
sgx_record_epc_page() function adds the epc_page to the "reclaimable"
list in the sgx_epc_lru_lists struct, while sgx_drop_epc_page() removes
the page from the LRU list.

For now, this change serves as a straightforward replacement of the two
functions for pages tracked by the reclaimer. When the unreclaimable
list is added to track VA and SECS pages for cgroups, these functions
will be updated to add/remove them from the unreclaimable lists.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V5:
- style fixes (Jarkko)

V4:
- Code update needed for patch reordering
- Revised commit message.
---
 arch/x86/kernel/cpu/sgx/encl.c  |  6 +++---
 arch/x86/kernel/cpu/sgx/ioctl.c |  8 ++++----
 arch/x86/kernel/cpu/sgx/main.c  | 22 ++++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h   |  4 ++--
 4 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..97a53e34a8b4 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -272,7 +272,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_mark_page_reclaimable(entry->epc_page);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 
 	return entry;
 }
@@ -398,7 +398,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl_page->type = SGX_PAGE_TYPE_REG;
 	encl->secs_child_cnt++;
 
-	sgx_mark_page_reclaimable(encl_page->epc_page);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -714,7 +714,7 @@ void sgx_encl_release(struct kref *ref)
 			 * The page and its radix tree entry cannot be freed
 			 * if the page is being held by the reclaimer.
 			 */
-			if (sgx_unmark_page_reclaimable(entry->epc_page))
+			if (sgx_drop_epc_page(entry->epc_page))
 				continue;
 
 			sgx_encl_free_epc_page(entry->epc_page);
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 5d390df21440..a75eb44022a3 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -322,7 +322,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_mark_page_reclaimable(encl_page->epc_page);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -961,7 +961,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 			 * Prevent page from being reclaimed while mutex
 			 * is released.
 			 */
-			if (sgx_unmark_page_reclaimable(entry->epc_page)) {
+			if (sgx_drop_epc_page(entry->epc_page)) {
 				ret = -EAGAIN;
 				goto out_entry_changed;
 			}
@@ -976,7 +976,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 
 			mutex_lock(&encl->lock);
 
-			sgx_mark_page_reclaimable(entry->epc_page);
+			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 		}
 
 		/* Change EPC type */
@@ -1133,7 +1133,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
 			goto out_unlock;
 		}
 
-		if (sgx_unmark_page_reclaimable(entry->epc_page)) {
+		if (sgx_drop_epc_page(entry->epc_page)) {
 			ret = -EBUSY;
 			goto out_unlock;
 		}
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index afce51d6e94a..dec1d57cbff6 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -268,7 +268,6 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 			goto out;
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
-
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
@@ -498,31 +497,34 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_record_epc_page() - Add a page to the appropriate LRU list
  * @page:	EPC page
+ * @flags:	The type of page that is being recorded
  *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
+ * Mark a page with the specified flags and add it to the appropriate
+ * list.
  */
-void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
+void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
 	spin_lock(&sgx_global_lru.lock);
-	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	page->flags |= flags;
+	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
+		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_drop_epc_page() - Remove a page from a LRU list
  * @page:	EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
  *   -EBUSY if the page is in the process of being reclaimed
  */
-int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
+int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 018414b2abe8..113d930fd087 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -101,8 +101,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
 void sgx_reclaim_direct(void);
-void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
-int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
+void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
+int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 
 void sgx_ipi_cb(void *info);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 06/18] x86/sgx: Introduce EPC page states
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

Use the lower 3 bits in the flags field of sgx_epc_page struct to
track EPC states in its life cycle and define an enum for possible
states. More state(s) will be added later.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
V4:
- No changes other than required for patch reordering.

V3:
- This is new in V3 to replace the bit mask based approach (requested by Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  | 14 +++++++---
 arch/x86/kernel/cpu/sgx/ioctl.c |  7 +++--
 arch/x86/kernel/cpu/sgx/main.c  | 19 +++++++------
 arch/x86/kernel/cpu/sgx/sgx.h   | 49 ++++++++++++++++++++++++++++++---
 4 files changed, 71 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 97a53e34a8b4..f5afc8d65e22 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -244,8 +244,12 @@ static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
 {
 	struct sgx_epc_page *epc_page = encl->secs.epc_page;
 
-	if (!epc_page)
+	if (!epc_page) {
 		epc_page = sgx_encl_eldu(&encl->secs, NULL);
+		if (!IS_ERR(epc_page))
+			sgx_record_epc_page(epc_page,
+					    SGX_EPC_PAGE_UNRECLAIMABLE);
+	}
 
 	return epc_page;
 }
@@ -272,7 +276,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 
 	return entry;
 }
@@ -398,7 +402,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl_page->type = SGX_PAGE_TYPE_REG;
 	encl->secs_child_cnt++;
 
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -1256,6 +1260,8 @@ struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
 		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
+	sgx_record_epc_page(epc_page,
+			    SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	return epc_page;
 }
@@ -1315,7 +1321,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
 {
 	int ret;
 
-	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
 
 	ret = __eremove(sgx_get_epc_virt_addr(page));
 	if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index a75eb44022a3..9a32bf5a1070 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes = secs->attributes;
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
+	sgx_record_epc_page(encl->secs.epc_page,
+			    SGX_EPC_PAGE_UNRECLAIMABLE);
+
 	/* Set only after completion, as encl->lock has not been taken. */
 	set_bit(SGX_ENCL_CREATED, &encl->flags);
 
@@ -322,7 +325,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -976,7 +979,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 
 			mutex_lock(&encl->lock);
 
-			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
+			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 		}
 
 		/* Change EPC type */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index dec1d57cbff6..b26860399402 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -318,7 +318,7 @@ static void sgx_reclaim_pages(void)
 			/* The owner is freeing the page. No need to add the
 			 * page back to the list of reclaimable pages.
 			 */
-			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+			sgx_epc_page_reset_state(epc_page);
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -344,6 +344,7 @@ static void sgx_reclaim_pages(void)
 
 skip:
 		spin_lock(&sgx_global_lru.lock);
+		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
 		spin_unlock(&sgx_global_lru.lock);
 
@@ -367,7 +368,7 @@ static void sgx_reclaim_pages(void)
 		sgx_reclaimer_write(epc_page, &backing[i]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
-		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+		sgx_epc_page_reset_state(epc_page);
 
 		sgx_free_epc_page(epc_page);
 	}
@@ -507,9 +508,9 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
 	spin_lock(&sgx_global_lru.lock);
-	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
 	page->flags |= flags;
-	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
+	if (sgx_epc_page_reclaimable(flags))
 		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
@@ -527,7 +528,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
-	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
+	if (sgx_epc_page_reclaimable(page->flags)) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
 			spin_unlock(&sgx_global_lru.lock);
@@ -535,7 +536,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
 		}
 
 		list_del(&page->list);
-		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+		sgx_epc_page_reset_state(page);
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -607,6 +608,8 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
 	struct sgx_numa_node *node = section->node;
 
+	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
+
 	spin_lock(&node->lock);
 
 	page->owner = NULL;
@@ -614,7 +617,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 		list_add(&page->list, &node->sgx_poison_page_list);
 	else
 		list_add_tail(&page->list, &node->free_page_list);
-	page->flags = SGX_EPC_PAGE_IS_FREE;
+	page->flags = SGX_EPC_PAGE_FREE;
 
 	spin_unlock(&node->lock);
 	atomic_long_inc(&sgx_nr_free_pages);
@@ -715,7 +718,7 @@ int arch_memory_failure(unsigned long pfn, int flags)
 	 * If the page is on a free list, move it to the per-node
 	 * poison page list.
 	 */
-	if (page->flags & SGX_EPC_PAGE_IS_FREE) {
+	if (page->flags == SGX_EPC_PAGE_FREE) {
 		list_move(&page->list, &node->sgx_poison_page_list);
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 113d930fd087..2faeb40b345f 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -23,11 +23,36 @@
 #define SGX_NR_LOW_PAGES		32
 #define SGX_NR_HIGH_PAGES		64
 
-/* Pages, which are being tracked by the page reclaimer. */
-#define SGX_EPC_PAGE_RECLAIMER_TRACKED	BIT(0)
+enum sgx_epc_page_state {
+	/* Not tracked by the reclaimer:
+	 * Pages allocated for virtual EPC which are never tracked by the host
+	 * reclaimer; pages just allocated from free list but not yet put in
+	 * use; pages just reclaimed, but not yet returned to the free list.
+	 * Becomes FREE after sgx_free_epc()
+	 * Becomes RECLAIMABLE or UNRECLAIMABLE after sgx_record_epc()
+	 */
+	SGX_EPC_PAGE_NOT_TRACKED = 0,
+
+	/* Page is in the free list, ready for allocation
+	 * Becomes NOT_TRACKED after sgx_alloc_epc_page()
+	 */
+	SGX_EPC_PAGE_FREE = 1,
+
+	/* Page is in use and tracked in a reclaimable LRU list
+	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 */
+	SGX_EPC_PAGE_RECLAIMABLE = 2,
+
+	/* Page is in use but tracked in an unreclaimable LRU list. These are
+	 * only reclaimable when the whole enclave is OOM killed or the enclave
+	 * is released, e.g., VA, SECS pages
+	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 */
+	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
 
-/* Pages on free list */
-#define SGX_EPC_PAGE_IS_FREE		BIT(1)
+};
+
+#define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
 
 struct sgx_epc_page {
 	unsigned int section;
@@ -37,6 +62,22 @@ struct sgx_epc_page {
 	struct list_head list;
 };
 
+static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
+{
+	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
+}
+
+static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned long flags)
+{
+	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
+	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
+static inline bool sgx_epc_page_reclaimable(unsigned long flags)
+{
+	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
 /*
  * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
  * the free page list local to the node is stored here.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 06/18] x86/sgx: Introduce EPC page states
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

Use the lower 3 bits in the flags field of sgx_epc_page struct to
track EPC states in its life cycle and define an enum for possible
states. More state(s) will be added later.

Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
V4:
- No changes other than required for patch reordering.

V3:
- This is new in V3 to replace the bit mask based approach (requested by Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  | 14 +++++++---
 arch/x86/kernel/cpu/sgx/ioctl.c |  7 +++--
 arch/x86/kernel/cpu/sgx/main.c  | 19 +++++++------
 arch/x86/kernel/cpu/sgx/sgx.h   | 49 ++++++++++++++++++++++++++++++---
 4 files changed, 71 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 97a53e34a8b4..f5afc8d65e22 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -244,8 +244,12 @@ static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
 {
 	struct sgx_epc_page *epc_page = encl->secs.epc_page;
 
-	if (!epc_page)
+	if (!epc_page) {
 		epc_page = sgx_encl_eldu(&encl->secs, NULL);
+		if (!IS_ERR(epc_page))
+			sgx_record_epc_page(epc_page,
+					    SGX_EPC_PAGE_UNRECLAIMABLE);
+	}
 
 	return epc_page;
 }
@@ -272,7 +276,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 
 	return entry;
 }
@@ -398,7 +402,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl_page->type = SGX_PAGE_TYPE_REG;
 	encl->secs_child_cnt++;
 
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -1256,6 +1260,8 @@ struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
 		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
+	sgx_record_epc_page(epc_page,
+			    SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	return epc_page;
 }
@@ -1315,7 +1321,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
 {
 	int ret;
 
-	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
 
 	ret = __eremove(sgx_get_epc_virt_addr(page));
 	if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index a75eb44022a3..9a32bf5a1070 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes = secs->attributes;
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
+	sgx_record_epc_page(encl->secs.epc_page,
+			    SGX_EPC_PAGE_UNRECLAIMABLE);
+
 	/* Set only after completion, as encl->lock has not been taken. */
 	set_bit(SGX_ENCL_CREATED, &encl->flags);
 
@@ -322,7 +325,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -976,7 +979,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 
 			mutex_lock(&encl->lock);
 
-			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
+			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 		}
 
 		/* Change EPC type */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index dec1d57cbff6..b26860399402 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -318,7 +318,7 @@ static void sgx_reclaim_pages(void)
 			/* The owner is freeing the page. No need to add the
 			 * page back to the list of reclaimable pages.
 			 */
-			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+			sgx_epc_page_reset_state(epc_page);
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -344,6 +344,7 @@ static void sgx_reclaim_pages(void)
 
 skip:
 		spin_lock(&sgx_global_lru.lock);
+		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
 		spin_unlock(&sgx_global_lru.lock);
 
@@ -367,7 +368,7 @@ static void sgx_reclaim_pages(void)
 		sgx_reclaimer_write(epc_page, &backing[i]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
-		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+		sgx_epc_page_reset_state(epc_page);
 
 		sgx_free_epc_page(epc_page);
 	}
@@ -507,9 +508,9 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
 	spin_lock(&sgx_global_lru.lock);
-	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
 	page->flags |= flags;
-	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
+	if (sgx_epc_page_reclaimable(flags))
 		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
@@ -527,7 +528,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
-	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
+	if (sgx_epc_page_reclaimable(page->flags)) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
 			spin_unlock(&sgx_global_lru.lock);
@@ -535,7 +536,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
 		}
 
 		list_del(&page->list);
-		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+		sgx_epc_page_reset_state(page);
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -607,6 +608,8 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
 	struct sgx_numa_node *node = section->node;
 
+	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
+
 	spin_lock(&node->lock);
 
 	page->owner = NULL;
@@ -614,7 +617,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 		list_add(&page->list, &node->sgx_poison_page_list);
 	else
 		list_add_tail(&page->list, &node->free_page_list);
-	page->flags = SGX_EPC_PAGE_IS_FREE;
+	page->flags = SGX_EPC_PAGE_FREE;
 
 	spin_unlock(&node->lock);
 	atomic_long_inc(&sgx_nr_free_pages);
@@ -715,7 +718,7 @@ int arch_memory_failure(unsigned long pfn, int flags)
 	 * If the page is on a free list, move it to the per-node
 	 * poison page list.
 	 */
-	if (page->flags & SGX_EPC_PAGE_IS_FREE) {
+	if (page->flags == SGX_EPC_PAGE_FREE) {
 		list_move(&page->list, &node->sgx_poison_page_list);
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 113d930fd087..2faeb40b345f 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -23,11 +23,36 @@
 #define SGX_NR_LOW_PAGES		32
 #define SGX_NR_HIGH_PAGES		64
 
-/* Pages, which are being tracked by the page reclaimer. */
-#define SGX_EPC_PAGE_RECLAIMER_TRACKED	BIT(0)
+enum sgx_epc_page_state {
+	/* Not tracked by the reclaimer:
+	 * Pages allocated for virtual EPC which are never tracked by the host
+	 * reclaimer; pages just allocated from free list but not yet put in
+	 * use; pages just reclaimed, but not yet returned to the free list.
+	 * Becomes FREE after sgx_free_epc()
+	 * Becomes RECLAIMABLE or UNRECLAIMABLE after sgx_record_epc()
+	 */
+	SGX_EPC_PAGE_NOT_TRACKED = 0,
+
+	/* Page is in the free list, ready for allocation
+	 * Becomes NOT_TRACKED after sgx_alloc_epc_page()
+	 */
+	SGX_EPC_PAGE_FREE = 1,
+
+	/* Page is in use and tracked in a reclaimable LRU list
+	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 */
+	SGX_EPC_PAGE_RECLAIMABLE = 2,
+
+	/* Page is in use but tracked in an unreclaimable LRU list. These are
+	 * only reclaimable when the whole enclave is OOM killed or the enclave
+	 * is released, e.g., VA, SECS pages
+	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 */
+	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
 
-/* Pages on free list */
-#define SGX_EPC_PAGE_IS_FREE		BIT(1)
+};
+
+#define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
 
 struct sgx_epc_page {
 	unsigned int section;
@@ -37,6 +62,22 @@ struct sgx_epc_page {
 	struct list_head list;
 };
 
+static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
+{
+	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
+}
+
+static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned long flags)
+{
+	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
+	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
+static inline bool sgx_epc_page_reclaimable(unsigned long flags)
+{
+	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
 /*
  * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
  * the free page list local to the node is stored here.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add RECLAIM_IN_PROGRESS state to not rely on list_empty(&epc_page->list)
to determine if an EPC page is selected as a reclaiming candidate.

When a page is being reclaimed from the page pool (sgx_global_lru),
there is an intermediate stage where a page may have been identified as
a candidate for reclaiming, but has not yet been reclaimed.  Currently
such pages are list_del_init()'d from the global LRU list, and stored in
a an array on stack. To prevent another thread from dropping the same
page in the middle of reclaiming, sgx_drop_epc_page() checks for
list_empty(&epc_page->list).

A later patch will replace the array on stack with a temporary list to
store the candidate pages, so list_empty() should no longer be used for
this purpose.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V4:
- Fixed some typos.
- Revised commit message.

V3:
- Extend the sgx_epc_page_state enum introduced earlier to replace the
flag based approach.
---
 arch/x86/kernel/cpu/sgx/main.c | 21 ++++++++++-----------
 arch/x86/kernel/cpu/sgx/sgx.h  | 16 ++++++++++++++++
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b26860399402..c1ae19a154d0 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -312,13 +312,15 @@ static void sgx_reclaim_pages(void)
 		list_del_init(&epc_page->list);
 		encl_page = epc_page->owner;
 
-		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
+		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
+			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
 			chunk[cnt++] = epc_page;
-		else
+		} else {
 			/* The owner is freeing the page. No need to add the
 			 * page back to the list of reclaimable pages.
 			 */
 			sgx_epc_page_reset_state(epc_page);
+		}
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -528,16 +530,13 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
-	if (sgx_epc_page_reclaimable(page->flags)) {
-		/* The page is being reclaimed. */
-		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_global_lru.lock);
-			return -EBUSY;
-		}
-
-		list_del(&page->list);
-		sgx_epc_page_reset_state(page);
+	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
+		spin_unlock(&sgx_global_lru.lock);
+		return -EBUSY;
 	}
+
+	list_del(&page->list);
+	sgx_epc_page_reset_state(page);
 	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 2faeb40b345f..764cec23f4e5 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -40,6 +40,8 @@ enum sgx_epc_page_state {
 
 	/* Page is in use and tracked in a reclaimable LRU list
 	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 * Becomes RECLAIM_IN_PROGRESS in sgx_reclaim_pages() when identified
+	 * for reclaiming
 	 */
 	SGX_EPC_PAGE_RECLAIMABLE = 2,
 
@@ -50,6 +52,14 @@ enum sgx_epc_page_state {
 	 */
 	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
 
+	/* Page is being prepared for reclamation, tracked in a temporary
+	 * isolated list by the reclaimer.
+	 * Changes in sgx_reclaim_pages() back to RECLAIMABLE if preparation
+	 * fails for any reason.
+	 * Becomes NOT_TRACKED if reclaimed successfully in sgx_reclaim_pages()
+	 * and immediately sgx_free_epc() is called to make it FREE.
+	 */
+	SGX_EPC_PAGE_RECLAIM_IN_PROGRESS = 4,
 };
 
 #define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
@@ -73,6 +83,12 @@ static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned lo
 	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
 }
 
+static inline bool sgx_epc_page_reclaim_in_progress(unsigned long flags)
+{
+	return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags &
+						    SGX_EPC_PAGE_STATE_MASK);
+}
+
 static inline bool sgx_epc_page_reclaimable(unsigned long flags)
 {
 	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add RECLAIM_IN_PROGRESS state to not rely on list_empty(&epc_page->list)
to determine if an EPC page is selected as a reclaiming candidate.

When a page is being reclaimed from the page pool (sgx_global_lru),
there is an intermediate stage where a page may have been identified as
a candidate for reclaiming, but has not yet been reclaimed.  Currently
such pages are list_del_init()'d from the global LRU list, and stored in
a an array on stack. To prevent another thread from dropping the same
page in the middle of reclaiming, sgx_drop_epc_page() checks for
list_empty(&epc_page->list).

A later patch will replace the array on stack with a temporary list to
store the candidate pages, so list_empty() should no longer be used for
this purpose.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Fixed some typos.
- Revised commit message.

V3:
- Extend the sgx_epc_page_state enum introduced earlier to replace the
flag based approach.
---
 arch/x86/kernel/cpu/sgx/main.c | 21 ++++++++++-----------
 arch/x86/kernel/cpu/sgx/sgx.h  | 16 ++++++++++++++++
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b26860399402..c1ae19a154d0 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -312,13 +312,15 @@ static void sgx_reclaim_pages(void)
 		list_del_init(&epc_page->list);
 		encl_page = epc_page->owner;
 
-		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
+		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
+			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
 			chunk[cnt++] = epc_page;
-		else
+		} else {
 			/* The owner is freeing the page. No need to add the
 			 * page back to the list of reclaimable pages.
 			 */
 			sgx_epc_page_reset_state(epc_page);
+		}
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -528,16 +530,13 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
-	if (sgx_epc_page_reclaimable(page->flags)) {
-		/* The page is being reclaimed. */
-		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_global_lru.lock);
-			return -EBUSY;
-		}
-
-		list_del(&page->list);
-		sgx_epc_page_reset_state(page);
+	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
+		spin_unlock(&sgx_global_lru.lock);
+		return -EBUSY;
 	}
+
+	list_del(&page->list);
+	sgx_epc_page_reset_state(page);
 	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 2faeb40b345f..764cec23f4e5 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -40,6 +40,8 @@ enum sgx_epc_page_state {
 
 	/* Page is in use and tracked in a reclaimable LRU list
 	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 * Becomes RECLAIM_IN_PROGRESS in sgx_reclaim_pages() when identified
+	 * for reclaiming
 	 */
 	SGX_EPC_PAGE_RECLAIMABLE = 2,
 
@@ -50,6 +52,14 @@ enum sgx_epc_page_state {
 	 */
 	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
 
+	/* Page is being prepared for reclamation, tracked in a temporary
+	 * isolated list by the reclaimer.
+	 * Changes in sgx_reclaim_pages() back to RECLAIMABLE if preparation
+	 * fails for any reason.
+	 * Becomes NOT_TRACKED if reclaimed successfully in sgx_reclaim_pages()
+	 * and immediately sgx_free_epc() is called to make it FREE.
+	 */
+	SGX_EPC_PAGE_RECLAIM_IN_PROGRESS = 4,
 };
 
 #define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
@@ -73,6 +83,12 @@ static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned lo
 	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
 }
 
+static inline bool sgx_epc_page_reclaim_in_progress(unsigned long flags)
+{
+	return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags &
+						    SGX_EPC_PAGE_STATE_MASK);
+}
+
 static inline bool sgx_epc_page_reclaimable(unsigned long flags)
 {
 	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 08/18] x86/sgx: Use a list to track to-be-reclaimed pages
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

Change sgx_reclaim_pages() to use a list rather than an array for
storing the epc_pages which will be reclaimed. This change is needed
to transition to the LRU implementation for EPC cgroup support.

When the EPC cgroup is implemented, the reclaiming process will do a
pre-order tree walk for the subtree starting from the limit-violating
cgroup.  When each node is visited, candidate pages are selected from
its "reclaimable" LRU list and moved into this temporary list. Passing a
list from node to node for temporary storage in this walk is more
straightforward than using an array.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang<haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang<haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V4:
- Changes needed for patch reordering
- Revised commit message

V3:
- Removed list wrappers
---
 arch/x86/kernel/cpu/sgx/main.c | 40 +++++++++++++++-------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c1ae19a154d0..fba06dc5abfe 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -293,12 +293,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  */
 static void sgx_reclaim_pages(void)
 {
-	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
 	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
-	struct sgx_epc_page *epc_page;
 	pgoff_t page_index;
-	int cnt = 0;
+	LIST_HEAD(iso);
 	int ret;
 	int i;
 
@@ -314,18 +313,22 @@ static void sgx_reclaim_pages(void)
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
 			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-			chunk[cnt++] = epc_page;
+			list_move_tail(&epc_page->list, &iso);
 		} else {
-			/* The owner is freeing the page. No need to add the
-			 * page back to the list of reclaimable pages.
+			/* The owner is freeing the page, remove it from the
+			 * LRU list
 			 */
 			sgx_epc_page_reset_state(epc_page);
+			list_del_init(&epc_page->list);
 		}
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
+	if (list_empty(&iso))
+		return;
+
+	i = 0;
+	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->owner;
 
 		if (!sgx_reclaimer_age(epc_page))
@@ -340,6 +343,7 @@ static void sgx_reclaim_pages(void)
 			goto skip;
 		}
 
+		i++;
 		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
 		mutex_unlock(&encl_page->encl->lock);
 		continue;
@@ -347,27 +351,19 @@ static void sgx_reclaim_pages(void)
 skip:
 		spin_lock(&sgx_global_lru.lock);
 		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
-		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+		list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
 		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
-
-		chunk[i] = NULL;
-	}
-
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
-		if (epc_page)
-			sgx_reclaimer_block(epc_page);
 	}
 
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
-		if (!epc_page)
-			continue;
+	list_for_each_entry(epc_page, &iso, list)
+		sgx_reclaimer_block(epc_page);
 
+	i = 0;
+	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->owner;
-		sgx_reclaimer_write(epc_page, &backing[i]);
+		sgx_reclaimer_write(epc_page, &backing[i++]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 		sgx_epc_page_reset_state(epc_page);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 08/18] x86/sgx: Use a list to track to-be-reclaimed pages
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Change sgx_reclaim_pages() to use a list rather than an array for
storing the epc_pages which will be reclaimed. This change is needed
to transition to the LRU implementation for EPC cgroup support.

When the EPC cgroup is implemented, the reclaiming process will do a
pre-order tree walk for the subtree starting from the limit-violating
cgroup.  When each node is visited, candidate pages are selected from
its "reclaimable" LRU list and moved into this temporary list. Passing a
list from node to node for temporary storage in this walk is more
straightforward than using an array.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang<haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang<haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Changes needed for patch reordering
- Revised commit message

V3:
- Removed list wrappers
---
 arch/x86/kernel/cpu/sgx/main.c | 40 +++++++++++++++-------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c1ae19a154d0..fba06dc5abfe 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -293,12 +293,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  */
 static void sgx_reclaim_pages(void)
 {
-	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
 	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
-	struct sgx_epc_page *epc_page;
 	pgoff_t page_index;
-	int cnt = 0;
+	LIST_HEAD(iso);
 	int ret;
 	int i;
 
@@ -314,18 +313,22 @@ static void sgx_reclaim_pages(void)
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
 			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-			chunk[cnt++] = epc_page;
+			list_move_tail(&epc_page->list, &iso);
 		} else {
-			/* The owner is freeing the page. No need to add the
-			 * page back to the list of reclaimable pages.
+			/* The owner is freeing the page, remove it from the
+			 * LRU list
 			 */
 			sgx_epc_page_reset_state(epc_page);
+			list_del_init(&epc_page->list);
 		}
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
+	if (list_empty(&iso))
+		return;
+
+	i = 0;
+	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->owner;
 
 		if (!sgx_reclaimer_age(epc_page))
@@ -340,6 +343,7 @@ static void sgx_reclaim_pages(void)
 			goto skip;
 		}
 
+		i++;
 		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
 		mutex_unlock(&encl_page->encl->lock);
 		continue;
@@ -347,27 +351,19 @@ static void sgx_reclaim_pages(void)
 skip:
 		spin_lock(&sgx_global_lru.lock);
 		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
-		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+		list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
 		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
-
-		chunk[i] = NULL;
-	}
-
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
-		if (epc_page)
-			sgx_reclaimer_block(epc_page);
 	}
 
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
-		if (!epc_page)
-			continue;
+	list_for_each_entry(epc_page, &iso, list)
+		sgx_reclaimer_block(epc_page);
 
+	i = 0;
+	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->owner;
-		sgx_reclaimer_write(epc_page, &backing[i]);
+		sgx_reclaimer_write(epc_page, &backing[i++]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 		sgx_epc_page_reset_state(epc_page);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

In a later patch, when a cgroup has exceeded the max capacity for EPC
pages, it may need to identify and OOM kill a less active enclave to
make room for other enclaves within the same group. Such a victim
enclave would have no active pages other than the unreclaimable Version
Array (VA) and SECS pages.  Therefore, the cgroup needs examine its
unreclaimable page list, and finding an enclave given a SECS page or a
VA page. This will require a backpointer from a page to an enclave,
which is not available for VA pages.

Because struct sgx_epc_page instances of VA pages are not owned by an
sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
which will store this value in the owner field of the struct
sgx_epc_page.  In a later patch, VA pages will be placed in an
unreclaimable queue that can be examined by the cgroup to select the OOM
killed enclave.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V5:
- Fixed some comments in code (Jarkko)

V4:
- Changes needed for patch reordering
- Revised commit messages (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  |  5 +++--
 arch/x86/kernel/cpu/sgx/encl.h  |  2 +-
 arch/x86/kernel/cpu/sgx/ioctl.c |  2 +-
 arch/x86/kernel/cpu/sgx/main.c  | 20 ++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h   |  7 ++++++-
 5 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f5afc8d65e22..ec3402d41b63 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -1236,6 +1236,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
+ * @encl:    The new owner of the page.
  * @reclaim: Reclaim EPC pages directly if none available. Enclave
  *           mutex should not be held if this is set.
  *
@@ -1245,12 +1246,12 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 {
 	struct sgx_epc_page *epc_page;
 	int ret;
 
-	epc_page = sgx_alloc_epc_page(NULL, reclaim);
+	epc_page = sgx_alloc_epc_page(encl, reclaim);
 	if (IS_ERR(epc_page))
 		return ERR_CAST(epc_page);
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..831d63f80f5a 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,7 +116,7 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
 					  unsigned long offset,
 					  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 9a32bf5a1070..164256ea18d0 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -30,7 +30,7 @@ struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
 		if (!va_page)
 			return ERR_PTR(-ENOMEM);
 
-		va_page->epc_page = sgx_alloc_va_page(reclaim);
+		va_page->epc_page = sgx_alloc_va_page(encl, reclaim);
 		if (IS_ERR(va_page->epc_page)) {
 			err = ERR_CAST(va_page->epc_page);
 			kfree(va_page);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index fba06dc5abfe..ed813288af44 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -107,7 +107,7 @@ static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list)
 
 static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_page;
 	struct sgx_encl *encl = page->encl;
 	struct sgx_encl_mm *encl_mm;
 	bool ret = true;
@@ -139,7 +139,7 @@ static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 
 static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_page;
 	unsigned long addr = page->desc & PAGE_MASK;
 	struct sgx_encl *encl = page->encl;
 	int ret;
@@ -196,7 +196,7 @@ void sgx_ipi_cb(void *info)
 static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 			 struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_page;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_va_page *va_page;
 	unsigned int va_offset;
@@ -249,7 +249,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 				struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_page;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_backing secs_backing;
 	int ret;
@@ -309,7 +309,7 @@ static void sgx_reclaim_pages(void)
 			break;
 
 		list_del_init(&epc_page->list);
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
 			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
@@ -329,7 +329,7 @@ static void sgx_reclaim_pages(void)
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 
 		if (!sgx_reclaimer_age(epc_page))
 			goto skip;
@@ -362,7 +362,7 @@ static void sgx_reclaim_pages(void)
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 		sgx_reclaimer_write(epc_page, &backing[i++]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -562,7 +562,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
 		if (!IS_ERR(page)) {
-			page->owner = owner;
+			page->encl_page = owner;
 			break;
 		}
 
@@ -607,7 +607,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	spin_lock(&node->lock);
 
-	page->owner = NULL;
+	page->encl_page = NULL;
 	if (page->poison)
 		list_add(&page->list, &node->sgx_poison_page_list);
 	else
@@ -642,7 +642,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 	for (i = 0; i < nr_pages; i++) {
 		section->pages[i].section = index;
 		section->pages[i].flags = 0;
-		section->pages[i].owner = NULL;
+		section->pages[i].encl_page = NULL;
 		section->pages[i].poison = 0;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 764cec23f4e5..5110dd433b80 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -68,7 +68,12 @@ struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
 	u16 poison;
-	struct sgx_encl_page *owner;
+
+	/* Possible owner types */
+	union {
+		struct sgx_encl_page *encl_page;
+		struct sgx_encl *encl;
+	};
 	struct list_head list;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

In a later patch, when a cgroup has exceeded the max capacity for EPC
pages, it may need to identify and OOM kill a less active enclave to
make room for other enclaves within the same group. Such a victim
enclave would have no active pages other than the unreclaimable Version
Array (VA) and SECS pages.  Therefore, the cgroup needs examine its
unreclaimable page list, and finding an enclave given a SECS page or a
VA page. This will require a backpointer from a page to an enclave,
which is not available for VA pages.

Because struct sgx_epc_page instances of VA pages are not owned by an
sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
which will store this value in the owner field of the struct
sgx_epc_page.  In a later patch, VA pages will be placed in an
unreclaimable queue that can be examined by the cgroup to select the OOM
killed enclave.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V5:
- Fixed some comments in code (Jarkko)

V4:
- Changes needed for patch reordering
- Revised commit messages (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  |  5 +++--
 arch/x86/kernel/cpu/sgx/encl.h  |  2 +-
 arch/x86/kernel/cpu/sgx/ioctl.c |  2 +-
 arch/x86/kernel/cpu/sgx/main.c  | 20 ++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h   |  7 ++++++-
 5 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f5afc8d65e22..ec3402d41b63 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -1236,6 +1236,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
+ * @encl:    The new owner of the page.
  * @reclaim: Reclaim EPC pages directly if none available. Enclave
  *           mutex should not be held if this is set.
  *
@@ -1245,12 +1246,12 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 {
 	struct sgx_epc_page *epc_page;
 	int ret;
 
-	epc_page = sgx_alloc_epc_page(NULL, reclaim);
+	epc_page = sgx_alloc_epc_page(encl, reclaim);
 	if (IS_ERR(epc_page))
 		return ERR_CAST(epc_page);
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..831d63f80f5a 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,7 +116,7 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
 					  unsigned long offset,
 					  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 9a32bf5a1070..164256ea18d0 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -30,7 +30,7 @@ struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
 		if (!va_page)
 			return ERR_PTR(-ENOMEM);
 
-		va_page->epc_page = sgx_alloc_va_page(reclaim);
+		va_page->epc_page = sgx_alloc_va_page(encl, reclaim);
 		if (IS_ERR(va_page->epc_page)) {
 			err = ERR_CAST(va_page->epc_page);
 			kfree(va_page);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index fba06dc5abfe..ed813288af44 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -107,7 +107,7 @@ static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list)
 
 static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_page;
 	struct sgx_encl *encl = page->encl;
 	struct sgx_encl_mm *encl_mm;
 	bool ret = true;
@@ -139,7 +139,7 @@ static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 
 static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_page;
 	unsigned long addr = page->desc & PAGE_MASK;
 	struct sgx_encl *encl = page->encl;
 	int ret;
@@ -196,7 +196,7 @@ void sgx_ipi_cb(void *info)
 static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 			 struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_page;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_va_page *va_page;
 	unsigned int va_offset;
@@ -249,7 +249,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 				struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_page;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_backing secs_backing;
 	int ret;
@@ -309,7 +309,7 @@ static void sgx_reclaim_pages(void)
 			break;
 
 		list_del_init(&epc_page->list);
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
 			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
@@ -329,7 +329,7 @@ static void sgx_reclaim_pages(void)
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 
 		if (!sgx_reclaimer_age(epc_page))
 			goto skip;
@@ -362,7 +362,7 @@ static void sgx_reclaim_pages(void)
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 		sgx_reclaimer_write(epc_page, &backing[i++]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -562,7 +562,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
 		if (!IS_ERR(page)) {
-			page->owner = owner;
+			page->encl_page = owner;
 			break;
 		}
 
@@ -607,7 +607,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	spin_lock(&node->lock);
 
-	page->owner = NULL;
+	page->encl_page = NULL;
 	if (page->poison)
 		list_add(&page->list, &node->sgx_poison_page_list);
 	else
@@ -642,7 +642,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 	for (i = 0; i < nr_pages; i++) {
 		section->pages[i].section = index;
 		section->pages[i].flags = 0;
-		section->pages[i].owner = NULL;
+		section->pages[i].encl_page = NULL;
 		section->pages[i].poison = 0;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 764cec23f4e5..5110dd433b80 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -68,7 +68,12 @@ struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
 	u16 poison;
-	struct sgx_encl_page *owner;
+
+	/* Possible owner types */
+	union {
+		struct sgx_encl_page *encl_page;
+		struct sgx_encl *encl;
+	};
 	struct list_head list;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 10/18] x86/sgx: Add EPC page flags to identify owner types
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

Two types of owners of struct sgx_epc_page, 'sgx_encl' for VA pages and
'sgx_encl_page' can be stored in the previously introduced union field.

OOM support for cgroups requires that the owner needs to be identified
when selecting pages from the unreclaimable list. Address this by adding
flags for the owner type.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V4:
- Updates for patch reordering.
- Rename SGX_EPC_OWNER_ENCL_PAGE to SGX_EPC_OWNER_PAGE. (Jarkko)
- Commit message changes. (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  | 9 +++++----
 arch/x86/kernel/cpu/sgx/ioctl.c | 6 ++++--
 arch/x86/kernel/cpu/sgx/sgx.h   | 6 ++++++
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index ec3402d41b63..da1657813fce 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -248,6 +248,7 @@ static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
 		epc_page = sgx_encl_eldu(&encl->secs, NULL);
 		if (!IS_ERR(epc_page))
 			sgx_record_epc_page(epc_page,
+					    SGX_EPC_OWNER_PAGE |
 					    SGX_EPC_PAGE_UNRECLAIMABLE);
 	}
 
@@ -276,7 +277,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_PAGE | SGX_EPC_PAGE_RECLAIMABLE);
 
 	return entry;
 }
@@ -402,7 +403,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl_page->type = SGX_PAGE_TYPE_REG;
 	encl->secs_child_cnt++;
 
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_PAGE | SGX_EPC_PAGE_RECLAIMABLE);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -1261,8 +1262,8 @@ struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
-	sgx_record_epc_page(epc_page,
-			    SGX_EPC_PAGE_UNRECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL |
+				      SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	return epc_page;
 }
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 164256ea18d0..cd338e93acc1 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -114,6 +114,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
 	sgx_record_epc_page(encl->secs.epc_page,
+			    SGX_EPC_OWNER_PAGE |
 			    SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	/* Set only after completion, as encl->lock has not been taken. */
@@ -325,7 +326,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_PAGE | SGX_EPC_PAGE_RECLAIMABLE);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -979,7 +980,8 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 
 			mutex_lock(&encl->lock);
 
-			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMABLE);
+			sgx_record_epc_page(entry->epc_page,
+					    SGX_EPC_OWNER_PAGE | SGX_EPC_PAGE_RECLAIMABLE);
 		}
 
 		/* Change EPC type */
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 5110dd433b80..51aba1cd1937 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -64,6 +64,12 @@ enum sgx_epc_page_state {
 
 #define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
 
+/* flag for pages owned by a sgx_encl_page */
+#define SGX_EPC_OWNER_PAGE		BIT(3)
+
+/* flag for pages owned by a sgx_encl struct */
+#define SGX_EPC_OWNER_ENCL		BIT(4)
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 10/18] x86/sgx: Add EPC page flags to identify owner types
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Two types of owners of struct sgx_epc_page, 'sgx_encl' for VA pages and
'sgx_encl_page' can be stored in the previously introduced union field.

OOM support for cgroups requires that the owner needs to be identified
when selecting pages from the unreclaimable list. Address this by adding
flags for the owner type.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Updates for patch reordering.
- Rename SGX_EPC_OWNER_ENCL_PAGE to SGX_EPC_OWNER_PAGE. (Jarkko)
- Commit message changes. (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  | 9 +++++----
 arch/x86/kernel/cpu/sgx/ioctl.c | 6 ++++--
 arch/x86/kernel/cpu/sgx/sgx.h   | 6 ++++++
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index ec3402d41b63..da1657813fce 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -248,6 +248,7 @@ static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
 		epc_page = sgx_encl_eldu(&encl->secs, NULL);
 		if (!IS_ERR(epc_page))
 			sgx_record_epc_page(epc_page,
+					    SGX_EPC_OWNER_PAGE |
 					    SGX_EPC_PAGE_UNRECLAIMABLE);
 	}
 
@@ -276,7 +277,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_PAGE | SGX_EPC_PAGE_RECLAIMABLE);
 
 	return entry;
 }
@@ -402,7 +403,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl_page->type = SGX_PAGE_TYPE_REG;
 	encl->secs_child_cnt++;
 
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_PAGE | SGX_EPC_PAGE_RECLAIMABLE);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -1261,8 +1262,8 @@ struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
-	sgx_record_epc_page(epc_page,
-			    SGX_EPC_PAGE_UNRECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL |
+				      SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	return epc_page;
 }
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 164256ea18d0..cd338e93acc1 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -114,6 +114,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
 	sgx_record_epc_page(encl->secs.epc_page,
+			    SGX_EPC_OWNER_PAGE |
 			    SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	/* Set only after completion, as encl->lock has not been taken. */
@@ -325,7 +326,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_PAGE | SGX_EPC_PAGE_RECLAIMABLE);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -979,7 +980,8 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 
 			mutex_lock(&encl->lock);
 
-			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMABLE);
+			sgx_record_epc_page(entry->epc_page,
+					    SGX_EPC_OWNER_PAGE | SGX_EPC_PAGE_RECLAIMABLE);
 		}
 
 		/* Change EPC type */
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 5110dd433b80..51aba1cd1937 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -64,6 +64,12 @@ enum sgx_epc_page_state {
 
 #define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
 
+/* flag for pages owned by a sgx_encl_page */
+#define SGX_EPC_OWNER_PAGE		BIT(3)
+
+/* flag for pages owned by a sgx_encl struct */
+#define SGX_EPC_OWNER_ENCL		BIT(4)
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 11/18] x86/sgx: store unreclaimable pages in LRU lists
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

When an OOM event occurs, all pages associated with an enclave will need
to be freed, including pages that are not currently tracked by the
cgroup LRU lists.

Add a new "unreclaimable" list to the sgx_epc_lru_lists struct and
update the "sgx_record/drop_epc_pages()" functions for adding/removing
VA and SECS pages to/from this "unreclaimable" list.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V4:
- Updates for patch reordering.
- Revised commit messages.
- Revised comments for the list.

V3:
- Removed tracking virtual EPC pages in unreclaimable list as host
kernel does not reclaim them. The EPC cgroups implemented later only
blocks allocating for a guest if the limit is reached by returning
-ENOMEM from sgx_alloc_epc_page() called by virt_epc, and does nothing
else. Therefore, no need to track those in LRU lists.
---
 arch/x86/kernel/cpu/sgx/encl.c  | 2 ++
 arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
 arch/x86/kernel/cpu/sgx/main.c  | 3 +++
 arch/x86/kernel/cpu/sgx/sgx.h   | 8 +++++++-
 4 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index da1657813fce..a8617e6a4b4e 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -746,6 +746,7 @@ void sgx_encl_release(struct kref *ref)
 	xa_destroy(&encl->page_array);
 
 	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_drop_epc_page(encl->secs.epc_page);
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 	}
@@ -754,6 +755,7 @@ void sgx_encl_release(struct kref *ref)
 		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
 					   list);
 		list_del(&va_page->list);
+		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		kfree(va_page);
 	}
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index cd338e93acc1..50ddd8988452 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -48,6 +48,7 @@ void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
 	encl->page_cnt--;
 
 	if (va_page) {
+		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		list_del(&va_page->list);
 		kfree(va_page);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index ed813288af44..f3a3ed894616 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -268,6 +268,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 			goto out;
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
+		sgx_drop_epc_page(encl->secs.epc_page);
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
@@ -510,6 +511,8 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 	page->flags |= flags;
 	if (sgx_epc_page_reclaimable(flags))
 		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	else
+		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 51aba1cd1937..337747bef7c2 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -152,17 +152,23 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 }
 
 /*
- * Tracks EPC pages reclaimable by the reclaimer (ksgxd).
+ * Contains EPC pages tracked by the reclaimer (ksgxd).
  */
 struct sgx_epc_lru_lists {
 	spinlock_t lock;
 	struct list_head reclaimable;
+	/*
+	 * Tracks SECS, VA pages,etc., pages only freeable after all its
+	 * dependent reclaimables are freed.
+	 */
+	struct list_head unreclaimable;
 };
 
 static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
 {
 	spin_lock_init(&lrus->lock);
 	INIT_LIST_HEAD(&lrus->reclaimable);
+	INIT_LIST_HEAD(&lrus->unreclaimable);
 }
 
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 11/18] x86/sgx: store unreclaimable pages in LRU lists
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

When an OOM event occurs, all pages associated with an enclave will need
to be freed, including pages that are not currently tracked by the
cgroup LRU lists.

Add a new "unreclaimable" list to the sgx_epc_lru_lists struct and
update the "sgx_record/drop_epc_pages()" functions for adding/removing
VA and SECS pages to/from this "unreclaimable" list.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Updates for patch reordering.
- Revised commit messages.
- Revised comments for the list.

V3:
- Removed tracking virtual EPC pages in unreclaimable list as host
kernel does not reclaim them. The EPC cgroups implemented later only
blocks allocating for a guest if the limit is reached by returning
-ENOMEM from sgx_alloc_epc_page() called by virt_epc, and does nothing
else. Therefore, no need to track those in LRU lists.
---
 arch/x86/kernel/cpu/sgx/encl.c  | 2 ++
 arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
 arch/x86/kernel/cpu/sgx/main.c  | 3 +++
 arch/x86/kernel/cpu/sgx/sgx.h   | 8 +++++++-
 4 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index da1657813fce..a8617e6a4b4e 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -746,6 +746,7 @@ void sgx_encl_release(struct kref *ref)
 	xa_destroy(&encl->page_array);
 
 	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_drop_epc_page(encl->secs.epc_page);
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 	}
@@ -754,6 +755,7 @@ void sgx_encl_release(struct kref *ref)
 		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
 					   list);
 		list_del(&va_page->list);
+		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		kfree(va_page);
 	}
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index cd338e93acc1..50ddd8988452 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -48,6 +48,7 @@ void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
 	encl->page_cnt--;
 
 	if (va_page) {
+		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		list_del(&va_page->list);
 		kfree(va_page);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index ed813288af44..f3a3ed894616 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -268,6 +268,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 			goto out;
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
+		sgx_drop_epc_page(encl->secs.epc_page);
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
@@ -510,6 +511,8 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 	page->flags |= flags;
 	if (sgx_epc_page_reclaimable(flags))
 		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	else
+		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 51aba1cd1937..337747bef7c2 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -152,17 +152,23 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 }
 
 /*
- * Tracks EPC pages reclaimable by the reclaimer (ksgxd).
+ * Contains EPC pages tracked by the reclaimer (ksgxd).
  */
 struct sgx_epc_lru_lists {
 	spinlock_t lock;
 	struct list_head reclaimable;
+	/*
+	 * Tracks SECS, VA pages,etc., pages only freeable after all its
+	 * dependent reclaimables are freed.
+	 */
+	struct list_head unreclaimable;
 };
 
 static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
 {
 	spin_lock_init(&lrus->lock);
 	INIT_LIST_HEAD(&lrus->reclaimable);
+	INIT_LIST_HEAD(&lrus->unreclaimable);
 }
 
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

Introduce the OOM path for killing an enclave with a reclaimer that is no
longer able to reclaim enough EPC pages. Find a victim enclave, which
will be an enclave with only "unreclaimable" EPC pages left in the
cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
and zap the enclave's entire page range, and drain all mm references in
encl->mm_list. Block allocating any EPC pages in #PF handler, or
reloading any pages in all paths, or creating any new mappings.

The OOM killing path may race with the reclaimers: in some cases, the
victim enclave is in the process of reclaiming the last EPC pages when
OOM happens, that is, all pages other than SECS and VA pages are in
RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
the enclave backing, VA pages as well as SECS. So the OOM killer does
not directly release those enclave resources, instead, it lets all
reclaiming in progress to finish, and relies (as currently done) on
kref_put on encl->refcount to trigger sgx_encl_release() to do the
final cleanup.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V5:
- Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY

V4:
- Updates for patch reordering and typo fixes.

V3:
- Rebased to use the new VMA_ITERATOR to zap VMAs.
- Fixed the racing cases by blocking new page allocation/mapping and
reloading when enclave is marked for OOM. And do not release any enclave
resources other than draining mm_list entries, and let pages in
RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
- Due to above changes, also removed the no-longer needed encl->lock in
the OOM path which was causing deadlocks reported by the lock prover.
---
 arch/x86/kernel/cpu/sgx/driver.c |  27 +-----
 arch/x86/kernel/cpu/sgx/encl.c   |  48 ++++++++++-
 arch/x86/kernel/cpu/sgx/encl.h   |   2 +
 arch/x86/kernel/cpu/sgx/ioctl.c  |   9 ++
 arch/x86/kernel/cpu/sgx/main.c   | 140 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h    |   1 +
 6 files changed, 200 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index 262f5fb18d74..ff42d649c7b6 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -44,7 +44,6 @@ static int sgx_open(struct inode *inode, struct file *file)
 static int sgx_release(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl = file->private_data;
-	struct sgx_encl_mm *encl_mm;
 
 	/*
 	 * Drain the remaining mm_list entries. At this point the list contains
@@ -52,31 +51,7 @@ static int sgx_release(struct inode *inode, struct file *file)
 	 * not exited yet. The processes, which have exited, are gone from the
 	 * list by sgx_mmu_notifier_release().
 	 */
-	for ( ; ; )  {
-		spin_lock(&encl->mm_lock);
-
-		if (list_empty(&encl->mm_list)) {
-			encl_mm = NULL;
-		} else {
-			encl_mm = list_first_entry(&encl->mm_list,
-						   struct sgx_encl_mm, list);
-			list_del_rcu(&encl_mm->list);
-		}
-
-		spin_unlock(&encl->mm_lock);
-
-		/* The enclave is no longer mapped by any mm. */
-		if (!encl_mm)
-			break;
-
-		synchronize_srcu(&encl->srcu);
-		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
-		kfree(encl_mm);
-
-		/* 'encl_mm' is gone, put encl_mm->encl reference: */
-		kref_put(&encl->refcount, sgx_encl_release);
-	}
-
+	sgx_encl_mm_drain(encl);
 	kref_put(&encl->refcount, sgx_encl_release);
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index a8617e6a4b4e..3c91a705e720 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -451,6 +451,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 	if (unlikely(!encl))
 		return VM_FAULT_SIGBUS;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return VM_FAULT_SIGBUS;
+
 	/*
 	 * The page_array keeps track of all enclave pages, whether they
 	 * are swapped out or not. If there is no entry for this page and
@@ -649,7 +652,8 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
 	if (!encl)
 		return -EFAULT;
 
-	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags))
+	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags) ||
+	    test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
 		return -EFAULT;
 
 	for (i = 0; i < len; i += cnt) {
@@ -774,6 +778,45 @@ void sgx_encl_release(struct kref *ref)
 	kfree(encl);
 }
 
+/**
+ * sgx_encl_mm_drain - drain all mm_list entries
+ * @encl:	address of the sgx_encl to drain
+ *
+ * Used during oom kill to empty the mm_list entries after they have been
+ * zapped. Or used by sgx_release to drain the remaining mm_list entries when
+ * the enclave fd is closing. After this call, sgx_encl_release will be called
+ * with kref_put.
+ */
+void sgx_encl_mm_drain(struct sgx_encl *encl)
+{
+	struct sgx_encl_mm *encl_mm;
+
+	for ( ; ; )  {
+		spin_lock(&encl->mm_lock);
+
+		if (list_empty(&encl->mm_list)) {
+			encl_mm = NULL;
+		} else {
+			encl_mm = list_first_entry(&encl->mm_list,
+						   struct sgx_encl_mm, list);
+			list_del_rcu(&encl_mm->list);
+		}
+
+		spin_unlock(&encl->mm_lock);
+
+		/* The enclave is no longer mapped by any mm. */
+		if (!encl_mm)
+			break;
+
+		synchronize_srcu(&encl->srcu);
+		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+		kfree(encl_mm);
+
+		/* 'encl_mm' is gone, put encl_mm->encl reference: */
+		kref_put(&encl->refcount, sgx_encl_release);
+	}
+}
+
 /*
  * 'mm' is exiting and no longer needs mmu notifications.
  */
@@ -845,6 +888,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
 	struct sgx_encl_mm *encl_mm;
 	int ret;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	/*
 	 * Even though a single enclave may be mapped into an mm more than once,
 	 * each 'mm' only appears once on encl->mm_list. This is guaranteed by
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 831d63f80f5a..cdb57ecb05c8 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -39,6 +39,7 @@ enum sgx_encl_flags {
 	SGX_ENCL_DEBUG		= BIT(1),
 	SGX_ENCL_CREATED	= BIT(2),
 	SGX_ENCL_INITIALIZED	= BIT(3),
+	SGX_ENCL_NO_MEMORY	= BIT(4),
 };
 
 struct sgx_encl_mm {
@@ -125,5 +126,6 @@ struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 					 unsigned long addr);
 struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
+void sgx_encl_mm_drain(struct sgx_encl *encl);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 50ddd8988452..e1209e2cf6a3 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -420,6 +420,9 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
 	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	if (copy_from_user(&add_arg, arg, sizeof(add_arg)))
 		return -EFAULT;
 
@@ -605,6 +608,9 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, void __user *arg)
 	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	if (copy_from_user(&init_arg, arg, sizeof(init_arg)))
 		return -EFAULT;
 
@@ -681,6 +687,9 @@ static int sgx_ioc_sgx2_ready(struct sgx_encl *encl)
 	if (!test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	return 0;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index f3a3ed894616..3b875ab4dcd0 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -621,6 +621,146 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
+static bool sgx_oom_get_ref(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl *encl;
+
+	if (epc_page->flags & SGX_EPC_OWNER_PAGE)
+		encl = epc_page->encl_page->encl;
+	else if (epc_page->flags & SGX_EPC_OWNER_ENCL)
+		encl = epc_page->encl;
+	else
+		return false;
+
+	return kref_get_unless_zero(&encl->refcount);
+}
+
+static struct sgx_epc_page *sgx_oom_get_victim(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *epc_page, *tmp;
+
+	if (list_empty(&lru->unreclaimable))
+		return NULL;
+
+	list_for_each_entry_safe(epc_page, tmp, &lru->unreclaimable, list) {
+		list_del_init(&epc_page->list);
+
+		if (sgx_oom_get_ref(epc_page))
+			return epc_page;
+	}
+	return NULL;
+}
+
+static void sgx_epc_oom_zap(void *owner, struct mm_struct *mm, unsigned long start,
+			    unsigned long end, const struct vm_operations_struct *ops)
+{
+	VMA_ITERATOR(vmi, mm, start);
+	struct vm_area_struct *vma;
+
+	/**
+	 * Use end because start can be zero and not mapped into
+	 * enclave even if encl->base = 0.
+	 */
+	for_each_vma_range(vmi, vma, end) {
+		if (vma->vm_ops == ops && vma->vm_private_data == owner &&
+		    vma->vm_start < end) {
+			zap_vma_pages(vma);
+		}
+	}
+}
+
+static bool sgx_oom_encl(struct sgx_encl *encl)
+{
+	unsigned long mm_list_version;
+	struct sgx_encl_mm *encl_mm;
+	bool ret = false;
+	int idx;
+
+	if (!test_bit(SGX_ENCL_CREATED, &encl->flags))
+		goto out_put;
+
+	/* Done OOM on this enclave previously, do not redo it.
+	 * This may happen when the SECS page is still UNRECLAIMABLE because
+	 * another page is in RECLAIM_IN_PROGRESS. Still return true so OOM
+	 * killer can wait until the reclaimer done with the hold-up page and
+	 * SECS before it move on to find another victim.
+	 */
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		goto out;
+
+	set_bit(SGX_ENCL_NO_MEMORY, &encl->flags);
+
+	do {
+		mm_list_version = encl->mm_list_version;
+
+		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
+		smp_rmb();
+
+		idx = srcu_read_lock(&encl->srcu);
+
+		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+			if (!mmget_not_zero(encl_mm->mm))
+				continue;
+
+			mmap_read_lock(encl_mm->mm);
+
+			sgx_epc_oom_zap(encl, encl_mm->mm, encl->base,
+					encl->base + encl->size, &sgx_vm_ops);
+
+			mmap_read_unlock(encl_mm->mm);
+
+			mmput_async(encl_mm->mm);
+		}
+
+		srcu_read_unlock(&encl->srcu, idx);
+	} while (WARN_ON_ONCE(encl->mm_list_version != mm_list_version));
+
+	sgx_encl_mm_drain(encl);
+out:
+	ret = true;
+
+out_put:
+	/*
+	 * This puts the refcount we took when we identified this enclave as
+	 * an OOM victim.
+	 */
+	kref_put(&encl->refcount, sgx_encl_release);
+	return ret;
+}
+
+static inline bool sgx_oom_encl_page(struct sgx_encl_page *encl_page)
+{
+	return sgx_oom_encl(encl_page->encl);
+}
+
+/**
+ * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
+ * @lru:	LRU that is low
+ *
+ * Return:	%true if a victim was found and kicked.
+ */
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *victim;
+
+	spin_lock(&lru->lock);
+	victim = sgx_oom_get_victim(lru);
+	spin_unlock(&lru->lock);
+
+	if (!victim)
+		return false;
+
+	if (victim->flags & SGX_EPC_OWNER_PAGE)
+		return sgx_oom_encl_page(victim->encl_page);
+
+	if (victim->flags & SGX_EPC_OWNER_ENCL)
+		return sgx_oom_encl(victim->encl);
+
+	/*Will never happen unless we add more owner types in future */
+	WARN_ON_ONCE(1);
+	return false;
+}
+
 static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 					 unsigned long index,
 					 struct sgx_epc_section *section)
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 337747bef7c2..6c0bfdc209c0 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -178,6 +178,7 @@ void sgx_reclaim_direct(void);
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Introduce the OOM path for killing an enclave with a reclaimer that is no
longer able to reclaim enough EPC pages. Find a victim enclave, which
will be an enclave with only "unreclaimable" EPC pages left in the
cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
and zap the enclave's entire page range, and drain all mm references in
encl->mm_list. Block allocating any EPC pages in #PF handler, or
reloading any pages in all paths, or creating any new mappings.

The OOM killing path may race with the reclaimers: in some cases, the
victim enclave is in the process of reclaiming the last EPC pages when
OOM happens, that is, all pages other than SECS and VA pages are in
RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
the enclave backing, VA pages as well as SECS. So the OOM killer does
not directly release those enclave resources, instead, it lets all
reclaiming in progress to finish, and relies (as currently done) on
kref_put on encl->refcount to trigger sgx_encl_release() to do the
final cleanup.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V5:
- Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY

V4:
- Updates for patch reordering and typo fixes.

V3:
- Rebased to use the new VMA_ITERATOR to zap VMAs.
- Fixed the racing cases by blocking new page allocation/mapping and
reloading when enclave is marked for OOM. And do not release any enclave
resources other than draining mm_list entries, and let pages in
RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
- Due to above changes, also removed the no-longer needed encl->lock in
the OOM path which was causing deadlocks reported by the lock prover.
---
 arch/x86/kernel/cpu/sgx/driver.c |  27 +-----
 arch/x86/kernel/cpu/sgx/encl.c   |  48 ++++++++++-
 arch/x86/kernel/cpu/sgx/encl.h   |   2 +
 arch/x86/kernel/cpu/sgx/ioctl.c  |   9 ++
 arch/x86/kernel/cpu/sgx/main.c   | 140 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h    |   1 +
 6 files changed, 200 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index 262f5fb18d74..ff42d649c7b6 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -44,7 +44,6 @@ static int sgx_open(struct inode *inode, struct file *file)
 static int sgx_release(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl = file->private_data;
-	struct sgx_encl_mm *encl_mm;
 
 	/*
 	 * Drain the remaining mm_list entries. At this point the list contains
@@ -52,31 +51,7 @@ static int sgx_release(struct inode *inode, struct file *file)
 	 * not exited yet. The processes, which have exited, are gone from the
 	 * list by sgx_mmu_notifier_release().
 	 */
-	for ( ; ; )  {
-		spin_lock(&encl->mm_lock);
-
-		if (list_empty(&encl->mm_list)) {
-			encl_mm = NULL;
-		} else {
-			encl_mm = list_first_entry(&encl->mm_list,
-						   struct sgx_encl_mm, list);
-			list_del_rcu(&encl_mm->list);
-		}
-
-		spin_unlock(&encl->mm_lock);
-
-		/* The enclave is no longer mapped by any mm. */
-		if (!encl_mm)
-			break;
-
-		synchronize_srcu(&encl->srcu);
-		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
-		kfree(encl_mm);
-
-		/* 'encl_mm' is gone, put encl_mm->encl reference: */
-		kref_put(&encl->refcount, sgx_encl_release);
-	}
-
+	sgx_encl_mm_drain(encl);
 	kref_put(&encl->refcount, sgx_encl_release);
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index a8617e6a4b4e..3c91a705e720 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -451,6 +451,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 	if (unlikely(!encl))
 		return VM_FAULT_SIGBUS;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return VM_FAULT_SIGBUS;
+
 	/*
 	 * The page_array keeps track of all enclave pages, whether they
 	 * are swapped out or not. If there is no entry for this page and
@@ -649,7 +652,8 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
 	if (!encl)
 		return -EFAULT;
 
-	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags))
+	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags) ||
+	    test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
 		return -EFAULT;
 
 	for (i = 0; i < len; i += cnt) {
@@ -774,6 +778,45 @@ void sgx_encl_release(struct kref *ref)
 	kfree(encl);
 }
 
+/**
+ * sgx_encl_mm_drain - drain all mm_list entries
+ * @encl:	address of the sgx_encl to drain
+ *
+ * Used during oom kill to empty the mm_list entries after they have been
+ * zapped. Or used by sgx_release to drain the remaining mm_list entries when
+ * the enclave fd is closing. After this call, sgx_encl_release will be called
+ * with kref_put.
+ */
+void sgx_encl_mm_drain(struct sgx_encl *encl)
+{
+	struct sgx_encl_mm *encl_mm;
+
+	for ( ; ; )  {
+		spin_lock(&encl->mm_lock);
+
+		if (list_empty(&encl->mm_list)) {
+			encl_mm = NULL;
+		} else {
+			encl_mm = list_first_entry(&encl->mm_list,
+						   struct sgx_encl_mm, list);
+			list_del_rcu(&encl_mm->list);
+		}
+
+		spin_unlock(&encl->mm_lock);
+
+		/* The enclave is no longer mapped by any mm. */
+		if (!encl_mm)
+			break;
+
+		synchronize_srcu(&encl->srcu);
+		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+		kfree(encl_mm);
+
+		/* 'encl_mm' is gone, put encl_mm->encl reference: */
+		kref_put(&encl->refcount, sgx_encl_release);
+	}
+}
+
 /*
  * 'mm' is exiting and no longer needs mmu notifications.
  */
@@ -845,6 +888,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
 	struct sgx_encl_mm *encl_mm;
 	int ret;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	/*
 	 * Even though a single enclave may be mapped into an mm more than once,
 	 * each 'mm' only appears once on encl->mm_list. This is guaranteed by
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 831d63f80f5a..cdb57ecb05c8 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -39,6 +39,7 @@ enum sgx_encl_flags {
 	SGX_ENCL_DEBUG		= BIT(1),
 	SGX_ENCL_CREATED	= BIT(2),
 	SGX_ENCL_INITIALIZED	= BIT(3),
+	SGX_ENCL_NO_MEMORY	= BIT(4),
 };
 
 struct sgx_encl_mm {
@@ -125,5 +126,6 @@ struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 					 unsigned long addr);
 struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
+void sgx_encl_mm_drain(struct sgx_encl *encl);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 50ddd8988452..e1209e2cf6a3 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -420,6 +420,9 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
 	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	if (copy_from_user(&add_arg, arg, sizeof(add_arg)))
 		return -EFAULT;
 
@@ -605,6 +608,9 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, void __user *arg)
 	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	if (copy_from_user(&init_arg, arg, sizeof(init_arg)))
 		return -EFAULT;
 
@@ -681,6 +687,9 @@ static int sgx_ioc_sgx2_ready(struct sgx_encl *encl)
 	if (!test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	return 0;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index f3a3ed894616..3b875ab4dcd0 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -621,6 +621,146 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
+static bool sgx_oom_get_ref(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl *encl;
+
+	if (epc_page->flags & SGX_EPC_OWNER_PAGE)
+		encl = epc_page->encl_page->encl;
+	else if (epc_page->flags & SGX_EPC_OWNER_ENCL)
+		encl = epc_page->encl;
+	else
+		return false;
+
+	return kref_get_unless_zero(&encl->refcount);
+}
+
+static struct sgx_epc_page *sgx_oom_get_victim(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *epc_page, *tmp;
+
+	if (list_empty(&lru->unreclaimable))
+		return NULL;
+
+	list_for_each_entry_safe(epc_page, tmp, &lru->unreclaimable, list) {
+		list_del_init(&epc_page->list);
+
+		if (sgx_oom_get_ref(epc_page))
+			return epc_page;
+	}
+	return NULL;
+}
+
+static void sgx_epc_oom_zap(void *owner, struct mm_struct *mm, unsigned long start,
+			    unsigned long end, const struct vm_operations_struct *ops)
+{
+	VMA_ITERATOR(vmi, mm, start);
+	struct vm_area_struct *vma;
+
+	/**
+	 * Use end because start can be zero and not mapped into
+	 * enclave even if encl->base = 0.
+	 */
+	for_each_vma_range(vmi, vma, end) {
+		if (vma->vm_ops == ops && vma->vm_private_data == owner &&
+		    vma->vm_start < end) {
+			zap_vma_pages(vma);
+		}
+	}
+}
+
+static bool sgx_oom_encl(struct sgx_encl *encl)
+{
+	unsigned long mm_list_version;
+	struct sgx_encl_mm *encl_mm;
+	bool ret = false;
+	int idx;
+
+	if (!test_bit(SGX_ENCL_CREATED, &encl->flags))
+		goto out_put;
+
+	/* Done OOM on this enclave previously, do not redo it.
+	 * This may happen when the SECS page is still UNRECLAIMABLE because
+	 * another page is in RECLAIM_IN_PROGRESS. Still return true so OOM
+	 * killer can wait until the reclaimer done with the hold-up page and
+	 * SECS before it move on to find another victim.
+	 */
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		goto out;
+
+	set_bit(SGX_ENCL_NO_MEMORY, &encl->flags);
+
+	do {
+		mm_list_version = encl->mm_list_version;
+
+		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
+		smp_rmb();
+
+		idx = srcu_read_lock(&encl->srcu);
+
+		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+			if (!mmget_not_zero(encl_mm->mm))
+				continue;
+
+			mmap_read_lock(encl_mm->mm);
+
+			sgx_epc_oom_zap(encl, encl_mm->mm, encl->base,
+					encl->base + encl->size, &sgx_vm_ops);
+
+			mmap_read_unlock(encl_mm->mm);
+
+			mmput_async(encl_mm->mm);
+		}
+
+		srcu_read_unlock(&encl->srcu, idx);
+	} while (WARN_ON_ONCE(encl->mm_list_version != mm_list_version));
+
+	sgx_encl_mm_drain(encl);
+out:
+	ret = true;
+
+out_put:
+	/*
+	 * This puts the refcount we took when we identified this enclave as
+	 * an OOM victim.
+	 */
+	kref_put(&encl->refcount, sgx_encl_release);
+	return ret;
+}
+
+static inline bool sgx_oom_encl_page(struct sgx_encl_page *encl_page)
+{
+	return sgx_oom_encl(encl_page->encl);
+}
+
+/**
+ * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
+ * @lru:	LRU that is low
+ *
+ * Return:	%true if a victim was found and kicked.
+ */
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *victim;
+
+	spin_lock(&lru->lock);
+	victim = sgx_oom_get_victim(lru);
+	spin_unlock(&lru->lock);
+
+	if (!victim)
+		return false;
+
+	if (victim->flags & SGX_EPC_OWNER_PAGE)
+		return sgx_oom_encl_page(victim->encl_page);
+
+	if (victim->flags & SGX_EPC_OWNER_ENCL)
+		return sgx_oom_encl(victim->encl);
+
+	/*Will never happen unless we add more owner types in future */
+	WARN_ON_ONCE(1);
+	return false;
+}
+
 static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 					 unsigned long index,
 					 struct sgx_epc_section *section)
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 337747bef7c2..6c0bfdc209c0 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -178,6 +178,7 @@ void sgx_reclaim_direct(void);
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

Adjust and expose the top-level reclaim function as
sgx_reclaim_epc_pages() for use by the upcoming EPC cgroup, which will
initiate reclaim to enforce the max limit.

Make these adjustments to the function signature.

1) To take a parameter that specifies the number of pages to scan for
reclaiming. Define a max value of 32, but scan 16 in the case for the
global reclaimer (ksgxd). The EPC cgroup will use it to specify a
desired number of pages to be reclaimed up to the max value of 32.

2) To take a flag to force reclaiming a page regardless of its age.  The
EPC cgroup will use the flag to enforce its limits by draining the
reclaimable lists before resorting to other measures, e.g. forcefully
kill enclaves.

3) Return the number of reclaimed pages. The EPC cgroup will use the
result to track reclaiming progress and escalate to a more forceful
reclaiming mode, e.g., calling this function with the flag to ignore age
of pages.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V4:
- Combined the 3 patches that made the individual changes to the
function signature.
- Removed 'high' limit in commit message.
---
 arch/x86/kernel/cpu/sgx/main.c | 31 +++++++++++++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h  |  1 +
 2 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 3b875ab4dcd0..4e1a3e038db5 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -18,6 +18,11 @@
 #include "encl.h"
 #include "encls.h"
 
+/*
+ * Maximum number of pages to scan for reclaiming.
+ */
+#define SGX_NR_TO_SCAN_MAX	32
+
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
 static struct task_struct *ksgxd_tsk;
@@ -279,7 +284,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
-/*
+/**
+ * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
+ * @nr_to_scan:		 Number of EPC pages to scan for reclaim
+ * @ignore_age:		 Reclaim a page even if it is young
+ *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
  * been accessed since the last scan. Move those pages to the tail of active
@@ -292,15 +301,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static void sgx_reclaim_pages(void)
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 {
-	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
 	pgoff_t page_index;
 	LIST_HEAD(iso);
-	int ret;
-	int i;
+	size_t ret, i;
 
 	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
@@ -326,13 +334,14 @@ static void sgx_reclaim_pages(void)
 	spin_unlock(&sgx_global_lru.lock);
 
 	if (list_empty(&iso))
-		return;
+		return 0;
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_page;
 
-		if (!sgx_reclaimer_age(epc_page))
+		if (i == SGX_NR_TO_SCAN_MAX ||
+		    (!ignore_age && !sgx_reclaimer_age(epc_page)))
 			goto skip;
 
 		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
@@ -371,6 +380,8 @@ static void sgx_reclaim_pages(void)
 
 		sgx_free_epc_page(epc_page);
 	}
+
+	return i;
 }
 
 static bool sgx_should_reclaim(unsigned long watermark)
@@ -387,7 +398,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages();
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 }
 
 static int ksgxd(void *p)
@@ -410,7 +421,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages();
+			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 
 		cond_resched();
 	}
@@ -582,7 +593,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages();
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 		cond_resched();
 	}
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 6c0bfdc209c0..7e7f1f36d31e 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -179,6 +179,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Adjust and expose the top-level reclaim function as
sgx_reclaim_epc_pages() for use by the upcoming EPC cgroup, which will
initiate reclaim to enforce the max limit.

Make these adjustments to the function signature.

1) To take a parameter that specifies the number of pages to scan for
reclaiming. Define a max value of 32, but scan 16 in the case for the
global reclaimer (ksgxd). The EPC cgroup will use it to specify a
desired number of pages to be reclaimed up to the max value of 32.

2) To take a flag to force reclaiming a page regardless of its age.  The
EPC cgroup will use the flag to enforce its limits by draining the
reclaimable lists before resorting to other measures, e.g. forcefully
kill enclaves.

3) Return the number of reclaimed pages. The EPC cgroup will use the
result to track reclaiming progress and escalate to a more forceful
reclaiming mode, e.g., calling this function with the flag to ignore age
of pages.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Combined the 3 patches that made the individual changes to the
function signature.
- Removed 'high' limit in commit message.
---
 arch/x86/kernel/cpu/sgx/main.c | 31 +++++++++++++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h  |  1 +
 2 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 3b875ab4dcd0..4e1a3e038db5 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -18,6 +18,11 @@
 #include "encl.h"
 #include "encls.h"
 
+/*
+ * Maximum number of pages to scan for reclaiming.
+ */
+#define SGX_NR_TO_SCAN_MAX	32
+
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
 static struct task_struct *ksgxd_tsk;
@@ -279,7 +284,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
-/*
+/**
+ * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
+ * @nr_to_scan:		 Number of EPC pages to scan for reclaim
+ * @ignore_age:		 Reclaim a page even if it is young
+ *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
  * been accessed since the last scan. Move those pages to the tail of active
@@ -292,15 +301,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static void sgx_reclaim_pages(void)
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 {
-	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
 	pgoff_t page_index;
 	LIST_HEAD(iso);
-	int ret;
-	int i;
+	size_t ret, i;
 
 	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
@@ -326,13 +334,14 @@ static void sgx_reclaim_pages(void)
 	spin_unlock(&sgx_global_lru.lock);
 
 	if (list_empty(&iso))
-		return;
+		return 0;
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_page;
 
-		if (!sgx_reclaimer_age(epc_page))
+		if (i == SGX_NR_TO_SCAN_MAX ||
+		    (!ignore_age && !sgx_reclaimer_age(epc_page)))
 			goto skip;
 
 		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
@@ -371,6 +380,8 @@ static void sgx_reclaim_pages(void)
 
 		sgx_free_epc_page(epc_page);
 	}
+
+	return i;
 }
 
 static bool sgx_should_reclaim(unsigned long watermark)
@@ -387,7 +398,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages();
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 }
 
 static int ksgxd(void *p)
@@ -410,7 +421,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages();
+			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 
 		cond_resched();
 	}
@@ -582,7 +593,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages();
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 		cond_resched();
 	}
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 6c0bfdc209c0..7e7f1f36d31e 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -179,6 +179,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 14/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

Move the isolation loop into a helper, sgx_isolate_pages(), in
preparation for existence of multiple LRUs. Expose the helper to other
SGX code so that it can be called from the EPC cgroup code, e.g., to
isolate pages from a single cgroup LRU. Exposing the isolation loop
allows the cgroup iteration logic to be wholly encapsulated within the
cgroup code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V4:
- No changes other than reordering the patches
---
 arch/x86/kernel/cpu/sgx/main.c | 57 +++++++++++++++++++++-------------
 arch/x86/kernel/cpu/sgx/sgx.h  |  2 ++
 2 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 4e1a3e038db5..b34ad3574c81 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -284,6 +284,40 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
+/**
+ * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
+ * @lru:	LRU from which to reclaim
+ * @nr_to_scan:	Number of pages to scan for reclaim
+ * @dst:	Destination list to hold the isolated pages
+ */
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
+			   struct list_head *dst)
+{
+	struct sgx_encl_page *encl_page;
+	struct sgx_epc_page *epc_page;
+
+	spin_lock(&lru->lock);
+	for (; nr_to_scan > 0; --nr_to_scan) {
+		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
+		if (!epc_page)
+			break;
+
+		encl_page = epc_page->encl_page;
+
+		if (kref_get_unless_zero(&encl_page->encl->refcount)) {
+			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
+			list_move_tail(&epc_page->list, dst);
+		} else {
+			/* The owner is freeing the page, remove it from the
+			 * LRU list
+			 */
+			sgx_epc_page_reset_state(epc_page);
+			list_del_init(&epc_page->list);
+		}
+	}
+	spin_unlock(&lru->lock);
+}
+
 /**
  * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
@@ -310,28 +344,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	LIST_HEAD(iso);
 	size_t ret, i;
 
-	spin_lock(&sgx_global_lru.lock);
-	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
-						    struct sgx_epc_page, list);
-		if (!epc_page)
-			break;
-
-		list_del_init(&epc_page->list);
-		encl_page = epc_page->encl_page;
-
-		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
-			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-			list_move_tail(&epc_page->list, &iso);
-		} else {
-			/* The owner is freeing the page, remove it from the
-			 * LRU list
-			 */
-			sgx_epc_page_reset_state(epc_page);
-			list_del_init(&epc_page->list);
-		}
-	}
-	spin_unlock(&sgx_global_lru.lock);
+	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
 
 	if (list_empty(&iso))
 		return 0;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 7e7f1f36d31e..42075762084c 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -180,6 +180,8 @@ int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
 size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
+			   struct list_head *dst);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 14/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Move the isolation loop into a helper, sgx_isolate_pages(), in
preparation for existence of multiple LRUs. Expose the helper to other
SGX code so that it can be called from the EPC cgroup code, e.g., to
isolate pages from a single cgroup LRU. Exposing the isolation loop
allows the cgroup iteration logic to be wholly encapsulated within the
cgroup code.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- No changes other than reordering the patches
---
 arch/x86/kernel/cpu/sgx/main.c | 57 +++++++++++++++++++++-------------
 arch/x86/kernel/cpu/sgx/sgx.h  |  2 ++
 2 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 4e1a3e038db5..b34ad3574c81 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -284,6 +284,40 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
+/**
+ * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
+ * @lru:	LRU from which to reclaim
+ * @nr_to_scan:	Number of pages to scan for reclaim
+ * @dst:	Destination list to hold the isolated pages
+ */
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
+			   struct list_head *dst)
+{
+	struct sgx_encl_page *encl_page;
+	struct sgx_epc_page *epc_page;
+
+	spin_lock(&lru->lock);
+	for (; nr_to_scan > 0; --nr_to_scan) {
+		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
+		if (!epc_page)
+			break;
+
+		encl_page = epc_page->encl_page;
+
+		if (kref_get_unless_zero(&encl_page->encl->refcount)) {
+			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
+			list_move_tail(&epc_page->list, dst);
+		} else {
+			/* The owner is freeing the page, remove it from the
+			 * LRU list
+			 */
+			sgx_epc_page_reset_state(epc_page);
+			list_del_init(&epc_page->list);
+		}
+	}
+	spin_unlock(&lru->lock);
+}
+
 /**
  * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
@@ -310,28 +344,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	LIST_HEAD(iso);
 	size_t ret, i;
 
-	spin_lock(&sgx_global_lru.lock);
-	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
-						    struct sgx_epc_page, list);
-		if (!epc_page)
-			break;
-
-		list_del_init(&epc_page->list);
-		encl_page = epc_page->encl_page;
-
-		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
-			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-			list_move_tail(&epc_page->list, &iso);
-		} else {
-			/* The owner is freeing the page, remove it from the
-			 * LRU list
-			 */
-			sgx_epc_page_reset_state(epc_page);
-			list_del_init(&epc_page->list);
-		}
-	}
-	spin_unlock(&sgx_global_lru.lock);
+	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
 
 	if (list_empty(&iso))
 		return 0;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 7e7f1f36d31e..42075762084c 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -180,6 +180,8 @@ int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
 size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
+			   struct list_head *dst);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 15/18] x86/sgx: Prepare for multiple LRUs
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add wrappers where a direct references to the global LRU list in the
reclaimer functions.  To support  multiple LRU lists (one per EPC
cgroup) later, only make changes inside these wrappers.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V5:
- Revise commit message to make the purpose more clear.

V4:
- Re-organized this patch to include all changes related to
encapsulation of the global LRU
- Moved this patch to precede the EPC cgroup patch
---
 arch/x86/kernel/cpu/sgx/main.c | 41 +++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b34ad3574c81..d37ef0dd865f 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -35,6 +35,16 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_lists sgx_global_lru;
 
+static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
+{
+	return &sgx_global_lru;
+}
+
+static inline bool sgx_can_reclaim(void)
+{
+	return !list_empty(&sgx_global_lru.reclaimable);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -340,6 +350,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
+	struct sgx_epc_lru_lists *lru;
 	pgoff_t page_index;
 	LIST_HEAD(iso);
 	size_t ret, i;
@@ -372,10 +383,11 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 		continue;
 
 skip:
-		spin_lock(&sgx_global_lru.lock);
+		lru = sgx_lru_lists(epc_page);
+		spin_lock(&lru->lock);
 		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
-		list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
-		spin_unlock(&sgx_global_lru.lock);
+		list_move_tail(&epc_page->list, &lru->reclaimable);
+		spin_unlock(&lru->lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 	}
@@ -400,7 +412,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_global_lru.reclaimable);
+		sgx_can_reclaim();
 }
 
 /*
@@ -530,14 +542,16 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
+
+	spin_lock(&lru->lock);
 	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
 	page->flags |= flags;
 	if (sgx_epc_page_reclaimable(flags))
-		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+		list_add_tail(&page->list, &lru->reclaimable);
 	else
-		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
-	spin_unlock(&sgx_global_lru.lock);
+		list_add_tail(&page->list, &lru->unreclaimable);
+	spin_unlock(&lru->lock);
 }
 
 /**
@@ -552,15 +566,16 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
  */
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
+
+	spin_lock(&lru->lock);
 	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
-		spin_unlock(&sgx_global_lru.lock);
+		spin_unlock(&lru->lock);
 		return -EBUSY;
 	}
-
 	list_del(&page->list);
 	sgx_epc_page_reset_state(page);
-	spin_unlock(&sgx_global_lru.lock);
+	spin_unlock(&lru->lock);
 
 	return 0;
 }
@@ -593,7 +608,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_global_lru.reclaimable))
+		if (!sgx_can_reclaim())
 			return ERR_PTR(-ENOMEM);
 
 		if (!reclaim) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 15/18] x86/sgx: Prepare for multiple LRUs
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add wrappers where a direct references to the global LRU list in the
reclaimer functions.  To support  multiple LRU lists (one per EPC
cgroup) later, only make changes inside these wrappers.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V5:
- Revise commit message to make the purpose more clear.

V4:
- Re-organized this patch to include all changes related to
encapsulation of the global LRU
- Moved this patch to precede the EPC cgroup patch
---
 arch/x86/kernel/cpu/sgx/main.c | 41 +++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b34ad3574c81..d37ef0dd865f 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -35,6 +35,16 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_lists sgx_global_lru;
 
+static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
+{
+	return &sgx_global_lru;
+}
+
+static inline bool sgx_can_reclaim(void)
+{
+	return !list_empty(&sgx_global_lru.reclaimable);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -340,6 +350,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
+	struct sgx_epc_lru_lists *lru;
 	pgoff_t page_index;
 	LIST_HEAD(iso);
 	size_t ret, i;
@@ -372,10 +383,11 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 		continue;
 
 skip:
-		spin_lock(&sgx_global_lru.lock);
+		lru = sgx_lru_lists(epc_page);
+		spin_lock(&lru->lock);
 		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
-		list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
-		spin_unlock(&sgx_global_lru.lock);
+		list_move_tail(&epc_page->list, &lru->reclaimable);
+		spin_unlock(&lru->lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 	}
@@ -400,7 +412,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_global_lru.reclaimable);
+		sgx_can_reclaim();
 }
 
 /*
@@ -530,14 +542,16 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
+
+	spin_lock(&lru->lock);
 	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
 	page->flags |= flags;
 	if (sgx_epc_page_reclaimable(flags))
-		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+		list_add_tail(&page->list, &lru->reclaimable);
 	else
-		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
-	spin_unlock(&sgx_global_lru.lock);
+		list_add_tail(&page->list, &lru->unreclaimable);
+	spin_unlock(&lru->lock);
 }
 
 /**
@@ -552,15 +566,16 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
  */
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
+
+	spin_lock(&lru->lock);
 	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
-		spin_unlock(&sgx_global_lru.lock);
+		spin_unlock(&lru->lock);
 		return -EBUSY;
 	}
-
 	list_del(&page->list);
 	sgx_epc_page_reset_state(page);
-	spin_unlock(&sgx_global_lru.lock);
+	spin_unlock(&lru->lock);
 
 	return 0;
 }
@@ -593,7 +608,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_global_lru.reclaimable))
+		if (!sgx_can_reclaim())
 			return ERR_PTR(-ENOMEM);
 
 		if (!reclaim) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Kristen Carlson Accardi <kristen@linux.intel.com>

Implement support for cgroup control of SGX Enclave Page Cache (EPC)
memory using the misc cgroup controller. EPC memory is independent
from normal system memory, e.g. must be reserved at boot from RAM and
cannot be converted between EPC and normal memory while the system is
running. EPC is managed by the SGX subsystem and is not accounted by
the memory controller.

Much like normal system memory, EPC memory can be overcommitted via
virtual memory techniques and pages can be swapped out of the EPC to
their backing store (normal system memory, e.g. shmem).  The SGX EPC
subsystem is analogous to the memory subsystem and the SGX EPC controller
is in turn analogous to the memory controller; it implements limit and
protection models for EPC memory.

The misc controller provides a mechanism to set a hard limit of EPC
usage via the "sgx_epc" resource in "misc.max". The total EPC memory
available on the system is reported via the "sgx_epc" resource in
"misc.capacity".

This patch was modified from its original version to use the misc cgroup
controller instead of a custom controller.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>

Cc: Sean Christopherson <seanjc@google.com>
---
V5:
- kernel-doc fixes (Jarkko)

V4:
- Fix a white space issue in Kconfig (Randy).
- Update comments for LRU list as it can be owned by a cgroup.
- Fix comments for sgx_reclaim_epc_pages() and use IS_ENABLED consistently (Mikko)

V3:

1) Use the same maximum number of reclaiming candidate pages to be
processed, SGX_NR_TO_SCAN_MAX, for each reclaiming iteration in both
cgroup worker function and ksgxd. This fixes an overflow in the
backing store buffer with the same fixed size allocated on stack in
sgx_reclaim_epc_pages().

2) Initialize max for root EPC cgroup. Otherwise, all
misc_cg_try_charge() calls would fail as it checks for all limits of
ancestors all the way to the root node.

3) Start reclaiming whenever misc_cg_try_charge fails. Removed all
re-checks for limits and current usage. For all purposes and intent,
when misc_try_charge() fails, reclaiming is needed. This also corrects
an error of not reclaiming when the child limit is larger than one of
its ancestors.

4) Handle failure on charging to the root EPC cgroup. Failure on charging
to root means we are at or above capacity, so start reclaiming or return
OOM error.

5) Removed the custom cgroup tree walking iterator with epoch tracking
logic. Replaced it with just the plain css_for_each_descendant_pre
iterator. The custom iterator implemented a rather complex epoch scheme
I believe was intended to prevent extra reclaiming from multiple worker
threads doing the same walk but it turned out not matter much as each
thread would only reclaim when usage is above limit. Using the plain
css_for_each_descendant_pre iterator simplified code a bit.

6) Do not reclaim synchronously in misc_max_write callback which would
block the user. Instead queue an async work item to run the reclaiming
loop.

7) Other minor refactoring:
- Remove unused params in epc_cgroup APIs
- centralize uncharge into sgx_free_epc_page()
---
 arch/x86/Kconfig                     |  13 +
 arch/x86/kernel/cpu/sgx/Makefile     |   1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 415 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 ++++
 arch/x86/kernel/cpu/sgx/main.c       |  68 ++++-
 arch/x86/kernel/cpu/sgx/sgx.h        |  17 +-
 6 files changed, 556 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..e17c5dc3aea4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1921,6 +1921,19 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config CGROUP_SGX_EPC
+	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
+	depends on X86_SGX && CGROUP_MISC
+	help
+	  Provides control over the EPC footprint of tasks in a cgroup via
+	  the Miscellaneous cgroup controller.
+
+	  EPC is a subset of regular memory that is usable only by SGX
+	  enclaves and is very limited in quantity, e.g. less than 1%
+	  of total DRAM.
+
+	  Say N if unsure.
+
 config X86_USER_SHADOW_STACK
 	bool "X86 userspace shadow stack"
 	depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
 	ioctl.o \
 	main.o
 obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC)	       += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..b5da89cf3a4c
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,415 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/ratelimit.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/threads.h>
+
+#include "epc_cgroup.h"
+
+#define SGX_EPC_RECLAIM_MIN_PAGES		16UL
+#define SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD	5
+#define SGX_EPC_RECLAIM_OOM_THRESHOLD		5
+
+static struct workqueue_struct *sgx_epc_cg_wq;
+static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root);
+
+struct sgx_epc_reclaim_control {
+	struct sgx_epc_cgroup *epc_cg;
+	int nr_fails;
+	bool ignore_age;
+};
+
+static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
+{
+	return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+	return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors.
+ */
+static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
+{
+	struct misc_cg *i = epc_cg->cg;
+	u64 m = U64_MAX;
+
+	while (i) {
+		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+		i = misc_cg_parent(i);
+	}
+
+	return m / PAGE_SIZE;
+}
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+	if (cg)
+		return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+
+	return NULL;
+}
+
+static inline bool sgx_epc_cgroup_disabled(void)
+{
+	return !cgroup_subsys_enabled(misc_cgrp_subsys);
+}
+
+/**
+ * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its lrus
+ * @root:	root of the tree to check
+ *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	struct cgroup_subsys_state *css_root;
+	struct cgroup_subsys_state *pos;
+	struct sgx_epc_cgroup *epc_cg;
+	bool ret = true;
+
+	/*
+	 * Caller ensure css_root ref acquired
+	 */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		if (!css_tryget(pos))
+			break;
+
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+
+		spin_lock(&epc_cg->lru.lock);
+		ret = list_empty(&epc_cg->lru.reclaimable);
+		spin_unlock(&epc_cg->lru.lock);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (!ret)
+			break;
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/**
+ * sgx_epc_cgroup_isolate_pages() - walk a cgroup tree and separate pages
+ * @root:	root of the tree to start walking
+ * @nr_to_scan: The number of pages that need to be isolated
+ * @dst:	Destination list to hold the isolated pages
+ *
+ * Walk the cgroup tree and isolate the pages in the hierarchy
+ * for reclaiming.
+ */
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+				  size_t *nr_to_scan, struct list_head *dst)
+{
+	struct cgroup_subsys_state *css_root;
+	struct cgroup_subsys_state *pos;
+	struct sgx_epc_cgroup *epc_cg;
+
+	if (!*nr_to_scan)
+		return;
+
+	 /* Caller ensure css_root ref acquired */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		if (!css_tryget(pos))
+			break;
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+		sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (!*nr_to_scan)
+			break;
+	}
+
+	rcu_read_unlock();
+}
+
+static int sgx_epc_cgroup_reclaim_pages(unsigned long nr_pages,
+					struct sgx_epc_reclaim_control *rc)
+{
+	/*
+	 * Ensure sgx_reclaim_pages is called with a minimum and maximum
+	 * number of pages.  Attempting to reclaim only a few pages will
+	 * often fail and is inefficient, while reclaiming a huge number
+	 * of pages can result in soft lockups due to holding various
+	 * locks for an extended duration.
+	 */
+	nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
+	nr_pages = min(nr_pages, SGX_NR_TO_SCAN_MAX);
+
+	return sgx_reclaim_epc_pages(nr_pages, rc->ignore_age, rc->epc_cg);
+}
+
+static int sgx_epc_cgroup_reclaim_failed(struct sgx_epc_reclaim_control *rc)
+{
+	if (sgx_epc_cgroup_lru_empty(rc->epc_cg))
+		return -ENOMEM;
+
+	++rc->nr_fails;
+	if (rc->nr_fails > SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD)
+		rc->ignore_age = true;
+
+	return 0;
+}
+
+static inline
+void sgx_epc_reclaim_control_init(struct sgx_epc_reclaim_control *rc,
+				  struct sgx_epc_cgroup *epc_cg)
+{
+	rc->epc_cg = epc_cg;
+	rc->nr_fails = 0;
+	rc->ignore_age = false;
+}
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
+ * cgroup when the cgroup is at/near its maximum capacity
+ */
+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+	struct sgx_epc_reclaim_control rc;
+	struct sgx_epc_cgroup *epc_cg;
+	u64 cur, max;
+
+	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+	for (;;) {
+		max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
+
+		/*
+		 * Adjust the limit down by one page, the goal is to free up
+		 * pages for fault allocations, not to simply obey the limit.
+		 * Conditionally decrementing max also means the cur vs. max
+		 * check will correctly handle the case where both are zero.
+		 */
+		if (max)
+			max--;
+
+		/*
+		 * Unless the limit is extremely low, in which case forcing
+		 * reclaim will likely cause thrashing, force the cgroup to
+		 * reclaim at least once if it's operating *near* its maximum
+		 * limit by adjusting @max down by half the min reclaim size.
+		 * This work func is scheduled by sgx_epc_cgroup_try_charge
+		 * when it cannot directly reclaim due to being in an atomic
+		 * context, e.g. EPC allocation in a fault handler.  Waiting
+		 * to reclaim until the cgroup is actually at its limit is less
+		 * performant as it means the faulting task is effectively
+		 * blocked until a worker makes its way through the global work
+		 * queue.
+		 */
+		if (max > SGX_NR_TO_SCAN_MAX)
+			max -= (SGX_EPC_RECLAIM_MIN_PAGES / 2);
+
+		max = min(max, sgx_epc_total_pages);
+		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+		if (cur <= max)
+			break;
+		/* Nothing reclaimable */
+		if (sgx_epc_cgroup_lru_empty(epc_cg)) {
+			if (!sgx_epc_cgroup_oom(epc_cg))
+				break;
+
+			continue;
+		}
+
+		if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc))
+				break;
+		}
+	}
+}
+
+static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
+				       bool reclaim)
+{
+	struct sgx_epc_reclaim_control rc;
+	unsigned int nr_empty = 0;
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+	for (;;) {
+		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+					PAGE_SIZE))
+			break;
+
+		if (sgx_epc_cgroup_lru_empty(epc_cg))
+			return -ENOMEM;
+
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+
+		if (!reclaim) {
+			queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
+			return -EBUSY;
+		}
+
+		if (!sgx_epc_cgroup_reclaim_pages(1, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
+				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
+					return -ENOMEM;
+				schedule();
+			}
+		}
+	}
+	if (epc_cg->cg != misc_cg_root())
+		css_get(&epc_cg->cg->css);
+
+	return 0;
+}
+
+/**
+ * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
+ * @mm:			the mm_struct of the process to charge
+ * @reclaim:		whether or not synchronous reclaim is allowed
+ *
+ * Returns EPC cgroup or NULL on success, -errno on failure.
+ */
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
+{
+	struct sgx_epc_cgroup *epc_cg;
+	int ret;
+
+	if (sgx_epc_cgroup_disabled())
+		return NULL;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
+	ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
+	put_misc_cg(epc_cg->cg);
+
+	if (ret)
+		return ERR_PTR(ret);
+
+	return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
+ * @epc_cg:	the charged epc cgroup
+ */
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
+{
+	if (sgx_epc_cgroup_disabled())
+		return;
+
+	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+
+	if (epc_cg->cg != misc_cg_root())
+		put_misc_cg(epc_cg->cg);
+}
+
+static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root)
+{
+	struct cgroup_subsys_state *css_root;
+	struct cgroup_subsys_state *pos;
+	struct sgx_epc_cgroup *epc_cg;
+	bool oom = false;
+
+	 /* Caller ensure css_root ref acquired */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		/* skip dead ones */
+		if (!css_tryget(pos))
+			continue;
+
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+		oom = sgx_epc_oom(&epc_cg->lru);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (oom)
+			break;
+	}
+
+	rcu_read_unlock();
+
+	return oom;
+}
+
+static void sgx_epc_cgroup_free(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+	cancel_work_sync(&epc_cg->reclaim_work);
+	kfree(epc_cg);
+}
+
+static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
+{
+	struct sgx_epc_reclaim_control rc;
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+	/* Let the reclaimer to do the work so user is not blocked */
+	queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
+}
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
+	if (!epc_cg)
+		return -ENOMEM;
+
+	sgx_lru_init(&epc_cg->lru);
+	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
+	cg->res[MISC_CG_RES_SGX_EPC].alloc = sgx_epc_cgroup_alloc;
+	cg->res[MISC_CG_RES_SGX_EPC].free = sgx_epc_cgroup_free;
+	cg->res[MISC_CG_RES_SGX_EPC].max_write = sgx_epc_cgroup_max_write;
+	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
+	epc_cg->cg = cg;
+
+	return 0;
+}
+
+static int __init sgx_epc_cgroup_init(void)
+{
+	struct misc_cg *cg;
+
+	if (!boot_cpu_has(X86_FEATURE_SGX))
+		return 0;
+
+	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
+					WQ_UNBOUND | WQ_FREEZABLE,
+					WQ_UNBOUND_MAX_ACTIVE);
+	BUG_ON(!sgx_epc_cg_wq);
+
+	cg = misc_cg_root();
+	BUG_ON(!cg);
+
+	return sgx_epc_cgroup_alloc(cg);
+}
+subsys_initcall(sgx_epc_cgroup_init);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..dfc902f4d96f
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _INTEL_SGX_EPC_CGROUP_H_
+#define _INTEL_SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/misc_cgroup.h>
+#include <linux/page_counter.h>
+#include <linux/workqueue.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_epc_cgroup;
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
+{
+	return NULL;
+}
+
+static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
+
+static inline void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+						size_t *nr_to_scan,
+						struct list_head *dst) { }
+
+static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+	return NULL;
+}
+
+static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	return true;
+}
+#else
+struct sgx_epc_cgroup {
+	struct misc_cg *cg;
+	struct sgx_epc_lru_lists	lru;
+	struct work_struct	reclaim_work;
+};
+
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim);
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root);
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+				  size_t *nr_to_scan, struct list_head *dst);
+static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+	if (epc_cg)
+		return &epc_cg->lru;
+	return NULL;
+}
+#endif
+
+#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index d37ef0dd865f..0ade7792ff5f 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
 #include <linux/highmem.h>
 #include <linux/kthread.h>
 #include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
 #include <linux/node.h>
 #include <linux/pagemap.h>
 #include <linux/ratelimit.h>
@@ -17,12 +18,9 @@
 #include "driver.h"
 #include "encl.h"
 #include "encls.h"
+#include "epc_cgroup.h"
 
-/*
- * Maximum number of pages to scan for reclaiming.
- */
-#define SGX_NR_TO_SCAN_MAX	32
-
+u64 sgx_epc_total_pages;
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
 static struct task_struct *ksgxd_tsk;
@@ -37,11 +35,17 @@ static struct sgx_epc_lru_lists sgx_global_lru;
 
 static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
 {
+	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		return epc_cg_lru(epc_page->epc_cg);
+
 	return &sgx_global_lru;
 }
 
 static inline bool sgx_can_reclaim(void)
 {
+	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		return !sgx_epc_cgroup_lru_empty(NULL);
+
 	return !list_empty(&sgx_global_lru.reclaimable);
 }
 
@@ -300,14 +304,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * @nr_to_scan:	Number of pages to scan for reclaim
  * @dst:	Destination list to hold the isolated pages
  */
-void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t *nr_to_scan,
 			   struct list_head *dst)
 {
 	struct sgx_encl_page *encl_page;
 	struct sgx_epc_page *epc_page;
 
 	spin_lock(&lru->lock);
-	for (; nr_to_scan > 0; --nr_to_scan) {
+	for (; *nr_to_scan > 0; --(*nr_to_scan)) {
 		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
 		if (!epc_page)
 			break;
@@ -332,6 +336,7 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
  * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
  * @ignore_age:		 Reclaim a page even if it is young
+ * @epc_cg:		 EPC cgroup from which to reclaim
  *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
@@ -345,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
+			     struct sgx_epc_cgroup *epc_cg)
 {
 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
@@ -355,7 +361,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	LIST_HEAD(iso);
 	size_t ret, i;
 
-	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
+	/*
+	 * If a specific cgroup is not being targeted, take from the global
+	 * list first, even when cgroups are enabled.  If there are
+	 * pages on the global LRU then they should get reclaimed asap.
+	 */
+	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
+		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
+
+	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
 
 	if (list_empty(&iso))
 		return 0;
@@ -423,7 +437,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 }
 
 static int ksgxd(void *p)
@@ -446,7 +460,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 
 		cond_resched();
 	}
@@ -600,6 +614,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 {
 	struct sgx_epc_page *page;
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_try_charge(reclaim);
+	if (IS_ERR(epc_cg))
+		return ERR_CAST(epc_cg);
 
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
@@ -608,8 +627,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (!sgx_can_reclaim())
-			return ERR_PTR(-ENOMEM);
+		if (!sgx_can_reclaim()) {
+			page = ERR_PTR(-ENOMEM);
+			break;
+		}
 
 		if (!reclaim) {
 			page = ERR_PTR(-EBUSY);
@@ -621,10 +642,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 		cond_resched();
 	}
 
+	if (!IS_ERR(page)) {
+		WARN_ON_ONCE(page->epc_cg);
+		page->epc_cg = epc_cg;
+	} else {
+		sgx_epc_cgroup_uncharge(epc_cg);
+	}
+
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
 		wake_up(&ksgxd_waitq);
 
@@ -647,6 +675,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
 
+	if (page->epc_cg) {
+		sgx_epc_cgroup_uncharge(page->epc_cg);
+		page->epc_cg = NULL;
+	}
+
 	spin_lock(&node->lock);
 
 	page->encl_page = NULL;
@@ -657,6 +690,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	page->flags = SGX_EPC_PAGE_FREE;
 
 	spin_unlock(&node->lock);
+
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
@@ -826,6 +860,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 		section->pages[i].flags = 0;
 		section->pages[i].encl_page = NULL;
 		section->pages[i].poison = 0;
+		section->pages[i].epc_cg = NULL;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
@@ -970,6 +1005,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
 static bool __init sgx_page_cache_init(void)
 {
 	u32 eax, ebx, ecx, edx, type;
+	u64 capacity = 0;
 	u64 pa, size;
 	int nid;
 	int i;
@@ -1020,6 +1056,7 @@ static bool __init sgx_page_cache_init(void)
 
 		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
 		sgx_numa_nodes[nid].size += size;
+		capacity += size;
 
 		sgx_nr_epc_sections++;
 	}
@@ -1029,6 +1066,9 @@ static bool __init sgx_page_cache_init(void)
 		return false;
 	}
 
+	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+	sgx_epc_total_pages = capacity >> PAGE_SHIFT;
+
 	return true;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 42075762084c..1b90a905a9e2 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -19,6 +19,11 @@
 
 #define SGX_MAX_EPC_SECTIONS		8
 #define SGX_EEXTEND_BLOCK_SIZE		256
+
+/*
+ * Maximum number of pages to scan for reclaiming.
+ */
+#define SGX_NR_TO_SCAN_MAX		32UL
 #define SGX_NR_TO_SCAN			16
 #define SGX_NR_LOW_PAGES		32
 #define SGX_NR_HIGH_PAGES		64
@@ -70,6 +75,8 @@ enum sgx_epc_page_state {
 /* flag for pages owned by a sgx_encl struct */
 #define SGX_EPC_OWNER_ENCL		BIT(4)
 
+struct sgx_epc_cgroup;
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
@@ -81,6 +88,7 @@ struct sgx_epc_page {
 		struct sgx_encl *encl;
 	};
 	struct list_head list;
+	struct sgx_epc_cgroup *epc_cg;
 };
 
 static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
@@ -129,6 +137,7 @@ struct sgx_epc_section {
 	struct sgx_numa_node *node;
 };
 
+extern u64 sgx_epc_total_pages;
 extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 
 static inline unsigned long sgx_get_epc_phys_addr(struct sgx_epc_page *page)
@@ -152,7 +161,8 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 }
 
 /*
- * Contains EPC pages tracked by the reclaimer (ksgxd).
+ * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
+ * cgroup.
  */
 struct sgx_epc_lru_lists {
 	spinlock_t lock;
@@ -179,8 +189,9 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
-size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
-void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
+			     struct sgx_epc_cgroup *epc_cg);
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t *nr_to_scan,
 			   struct list_head *dst);
 
 void sgx_ipi_cb(void *info);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Implement support for cgroup control of SGX Enclave Page Cache (EPC)
memory using the misc cgroup controller. EPC memory is independent
from normal system memory, e.g. must be reserved at boot from RAM and
cannot be converted between EPC and normal memory while the system is
running. EPC is managed by the SGX subsystem and is not accounted by
the memory controller.

Much like normal system memory, EPC memory can be overcommitted via
virtual memory techniques and pages can be swapped out of the EPC to
their backing store (normal system memory, e.g. shmem).  The SGX EPC
subsystem is analogous to the memory subsystem and the SGX EPC controller
is in turn analogous to the memory controller; it implements limit and
protection models for EPC memory.

The misc controller provides a mechanism to set a hard limit of EPC
usage via the "sgx_epc" resource in "misc.max". The total EPC memory
available on the system is reported via the "sgx_epc" resource in
"misc.capacity".

This patch was modified from its original version to use the misc cgroup
controller instead of a custom controller.

Co-developed-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Tested-by: Mikko Ylinen <mikko.ylinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V5:
- kernel-doc fixes (Jarkko)

V4:
- Fix a white space issue in Kconfig (Randy).
- Update comments for LRU list as it can be owned by a cgroup.
- Fix comments for sgx_reclaim_epc_pages() and use IS_ENABLED consistently (Mikko)

V3:

1) Use the same maximum number of reclaiming candidate pages to be
processed, SGX_NR_TO_SCAN_MAX, for each reclaiming iteration in both
cgroup worker function and ksgxd. This fixes an overflow in the
backing store buffer with the same fixed size allocated on stack in
sgx_reclaim_epc_pages().

2) Initialize max for root EPC cgroup. Otherwise, all
misc_cg_try_charge() calls would fail as it checks for all limits of
ancestors all the way to the root node.

3) Start reclaiming whenever misc_cg_try_charge fails. Removed all
re-checks for limits and current usage. For all purposes and intent,
when misc_try_charge() fails, reclaiming is needed. This also corrects
an error of not reclaiming when the child limit is larger than one of
its ancestors.

4) Handle failure on charging to the root EPC cgroup. Failure on charging
to root means we are at or above capacity, so start reclaiming or return
OOM error.

5) Removed the custom cgroup tree walking iterator with epoch tracking
logic. Replaced it with just the plain css_for_each_descendant_pre
iterator. The custom iterator implemented a rather complex epoch scheme
I believe was intended to prevent extra reclaiming from multiple worker
threads doing the same walk but it turned out not matter much as each
thread would only reclaim when usage is above limit. Using the plain
css_for_each_descendant_pre iterator simplified code a bit.

6) Do not reclaim synchronously in misc_max_write callback which would
block the user. Instead queue an async work item to run the reclaiming
loop.

7) Other minor refactoring:
- Remove unused params in epc_cgroup APIs
- centralize uncharge into sgx_free_epc_page()
---
 arch/x86/Kconfig                     |  13 +
 arch/x86/kernel/cpu/sgx/Makefile     |   1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 415 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 ++++
 arch/x86/kernel/cpu/sgx/main.c       |  68 ++++-
 arch/x86/kernel/cpu/sgx/sgx.h        |  17 +-
 6 files changed, 556 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..e17c5dc3aea4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1921,6 +1921,19 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config CGROUP_SGX_EPC
+	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
+	depends on X86_SGX && CGROUP_MISC
+	help
+	  Provides control over the EPC footprint of tasks in a cgroup via
+	  the Miscellaneous cgroup controller.
+
+	  EPC is a subset of regular memory that is usable only by SGX
+	  enclaves and is very limited in quantity, e.g. less than 1%
+	  of total DRAM.
+
+	  Say N if unsure.
+
 config X86_USER_SHADOW_STACK
 	bool "X86 userspace shadow stack"
 	depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
 	ioctl.o \
 	main.o
 obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC)	       += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..b5da89cf3a4c
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,415 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/ratelimit.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/threads.h>
+
+#include "epc_cgroup.h"
+
+#define SGX_EPC_RECLAIM_MIN_PAGES		16UL
+#define SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD	5
+#define SGX_EPC_RECLAIM_OOM_THRESHOLD		5
+
+static struct workqueue_struct *sgx_epc_cg_wq;
+static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root);
+
+struct sgx_epc_reclaim_control {
+	struct sgx_epc_cgroup *epc_cg;
+	int nr_fails;
+	bool ignore_age;
+};
+
+static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
+{
+	return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+	return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors.
+ */
+static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
+{
+	struct misc_cg *i = epc_cg->cg;
+	u64 m = U64_MAX;
+
+	while (i) {
+		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+		i = misc_cg_parent(i);
+	}
+
+	return m / PAGE_SIZE;
+}
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+	if (cg)
+		return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+
+	return NULL;
+}
+
+static inline bool sgx_epc_cgroup_disabled(void)
+{
+	return !cgroup_subsys_enabled(misc_cgrp_subsys);
+}
+
+/**
+ * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its lrus
+ * @root:	root of the tree to check
+ *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	struct cgroup_subsys_state *css_root;
+	struct cgroup_subsys_state *pos;
+	struct sgx_epc_cgroup *epc_cg;
+	bool ret = true;
+
+	/*
+	 * Caller ensure css_root ref acquired
+	 */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		if (!css_tryget(pos))
+			break;
+
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+
+		spin_lock(&epc_cg->lru.lock);
+		ret = list_empty(&epc_cg->lru.reclaimable);
+		spin_unlock(&epc_cg->lru.lock);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (!ret)
+			break;
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/**
+ * sgx_epc_cgroup_isolate_pages() - walk a cgroup tree and separate pages
+ * @root:	root of the tree to start walking
+ * @nr_to_scan: The number of pages that need to be isolated
+ * @dst:	Destination list to hold the isolated pages
+ *
+ * Walk the cgroup tree and isolate the pages in the hierarchy
+ * for reclaiming.
+ */
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+				  size_t *nr_to_scan, struct list_head *dst)
+{
+	struct cgroup_subsys_state *css_root;
+	struct cgroup_subsys_state *pos;
+	struct sgx_epc_cgroup *epc_cg;
+
+	if (!*nr_to_scan)
+		return;
+
+	 /* Caller ensure css_root ref acquired */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		if (!css_tryget(pos))
+			break;
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+		sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (!*nr_to_scan)
+			break;
+	}
+
+	rcu_read_unlock();
+}
+
+static int sgx_epc_cgroup_reclaim_pages(unsigned long nr_pages,
+					struct sgx_epc_reclaim_control *rc)
+{
+	/*
+	 * Ensure sgx_reclaim_pages is called with a minimum and maximum
+	 * number of pages.  Attempting to reclaim only a few pages will
+	 * often fail and is inefficient, while reclaiming a huge number
+	 * of pages can result in soft lockups due to holding various
+	 * locks for an extended duration.
+	 */
+	nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
+	nr_pages = min(nr_pages, SGX_NR_TO_SCAN_MAX);
+
+	return sgx_reclaim_epc_pages(nr_pages, rc->ignore_age, rc->epc_cg);
+}
+
+static int sgx_epc_cgroup_reclaim_failed(struct sgx_epc_reclaim_control *rc)
+{
+	if (sgx_epc_cgroup_lru_empty(rc->epc_cg))
+		return -ENOMEM;
+
+	++rc->nr_fails;
+	if (rc->nr_fails > SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD)
+		rc->ignore_age = true;
+
+	return 0;
+}
+
+static inline
+void sgx_epc_reclaim_control_init(struct sgx_epc_reclaim_control *rc,
+				  struct sgx_epc_cgroup *epc_cg)
+{
+	rc->epc_cg = epc_cg;
+	rc->nr_fails = 0;
+	rc->ignore_age = false;
+}
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
+ * cgroup when the cgroup is at/near its maximum capacity
+ */
+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+	struct sgx_epc_reclaim_control rc;
+	struct sgx_epc_cgroup *epc_cg;
+	u64 cur, max;
+
+	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+	for (;;) {
+		max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
+
+		/*
+		 * Adjust the limit down by one page, the goal is to free up
+		 * pages for fault allocations, not to simply obey the limit.
+		 * Conditionally decrementing max also means the cur vs. max
+		 * check will correctly handle the case where both are zero.
+		 */
+		if (max)
+			max--;
+
+		/*
+		 * Unless the limit is extremely low, in which case forcing
+		 * reclaim will likely cause thrashing, force the cgroup to
+		 * reclaim at least once if it's operating *near* its maximum
+		 * limit by adjusting @max down by half the min reclaim size.
+		 * This work func is scheduled by sgx_epc_cgroup_try_charge
+		 * when it cannot directly reclaim due to being in an atomic
+		 * context, e.g. EPC allocation in a fault handler.  Waiting
+		 * to reclaim until the cgroup is actually at its limit is less
+		 * performant as it means the faulting task is effectively
+		 * blocked until a worker makes its way through the global work
+		 * queue.
+		 */
+		if (max > SGX_NR_TO_SCAN_MAX)
+			max -= (SGX_EPC_RECLAIM_MIN_PAGES / 2);
+
+		max = min(max, sgx_epc_total_pages);
+		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+		if (cur <= max)
+			break;
+		/* Nothing reclaimable */
+		if (sgx_epc_cgroup_lru_empty(epc_cg)) {
+			if (!sgx_epc_cgroup_oom(epc_cg))
+				break;
+
+			continue;
+		}
+
+		if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc))
+				break;
+		}
+	}
+}
+
+static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
+				       bool reclaim)
+{
+	struct sgx_epc_reclaim_control rc;
+	unsigned int nr_empty = 0;
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+	for (;;) {
+		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+					PAGE_SIZE))
+			break;
+
+		if (sgx_epc_cgroup_lru_empty(epc_cg))
+			return -ENOMEM;
+
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+
+		if (!reclaim) {
+			queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
+			return -EBUSY;
+		}
+
+		if (!sgx_epc_cgroup_reclaim_pages(1, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
+				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
+					return -ENOMEM;
+				schedule();
+			}
+		}
+	}
+	if (epc_cg->cg != misc_cg_root())
+		css_get(&epc_cg->cg->css);
+
+	return 0;
+}
+
+/**
+ * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
+ * @mm:			the mm_struct of the process to charge
+ * @reclaim:		whether or not synchronous reclaim is allowed
+ *
+ * Returns EPC cgroup or NULL on success, -errno on failure.
+ */
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
+{
+	struct sgx_epc_cgroup *epc_cg;
+	int ret;
+
+	if (sgx_epc_cgroup_disabled())
+		return NULL;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
+	ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
+	put_misc_cg(epc_cg->cg);
+
+	if (ret)
+		return ERR_PTR(ret);
+
+	return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
+ * @epc_cg:	the charged epc cgroup
+ */
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
+{
+	if (sgx_epc_cgroup_disabled())
+		return;
+
+	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+
+	if (epc_cg->cg != misc_cg_root())
+		put_misc_cg(epc_cg->cg);
+}
+
+static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root)
+{
+	struct cgroup_subsys_state *css_root;
+	struct cgroup_subsys_state *pos;
+	struct sgx_epc_cgroup *epc_cg;
+	bool oom = false;
+
+	 /* Caller ensure css_root ref acquired */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		/* skip dead ones */
+		if (!css_tryget(pos))
+			continue;
+
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+		oom = sgx_epc_oom(&epc_cg->lru);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (oom)
+			break;
+	}
+
+	rcu_read_unlock();
+
+	return oom;
+}
+
+static void sgx_epc_cgroup_free(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+	cancel_work_sync(&epc_cg->reclaim_work);
+	kfree(epc_cg);
+}
+
+static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
+{
+	struct sgx_epc_reclaim_control rc;
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+	/* Let the reclaimer to do the work so user is not blocked */
+	queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
+}
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
+	if (!epc_cg)
+		return -ENOMEM;
+
+	sgx_lru_init(&epc_cg->lru);
+	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
+	cg->res[MISC_CG_RES_SGX_EPC].alloc = sgx_epc_cgroup_alloc;
+	cg->res[MISC_CG_RES_SGX_EPC].free = sgx_epc_cgroup_free;
+	cg->res[MISC_CG_RES_SGX_EPC].max_write = sgx_epc_cgroup_max_write;
+	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
+	epc_cg->cg = cg;
+
+	return 0;
+}
+
+static int __init sgx_epc_cgroup_init(void)
+{
+	struct misc_cg *cg;
+
+	if (!boot_cpu_has(X86_FEATURE_SGX))
+		return 0;
+
+	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
+					WQ_UNBOUND | WQ_FREEZABLE,
+					WQ_UNBOUND_MAX_ACTIVE);
+	BUG_ON(!sgx_epc_cg_wq);
+
+	cg = misc_cg_root();
+	BUG_ON(!cg);
+
+	return sgx_epc_cgroup_alloc(cg);
+}
+subsys_initcall(sgx_epc_cgroup_init);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..dfc902f4d96f
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _INTEL_SGX_EPC_CGROUP_H_
+#define _INTEL_SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/misc_cgroup.h>
+#include <linux/page_counter.h>
+#include <linux/workqueue.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_epc_cgroup;
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
+{
+	return NULL;
+}
+
+static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
+
+static inline void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+						size_t *nr_to_scan,
+						struct list_head *dst) { }
+
+static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+	return NULL;
+}
+
+static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	return true;
+}
+#else
+struct sgx_epc_cgroup {
+	struct misc_cg *cg;
+	struct sgx_epc_lru_lists	lru;
+	struct work_struct	reclaim_work;
+};
+
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim);
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root);
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+				  size_t *nr_to_scan, struct list_head *dst);
+static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+	if (epc_cg)
+		return &epc_cg->lru;
+	return NULL;
+}
+#endif
+
+#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index d37ef0dd865f..0ade7792ff5f 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
 #include <linux/highmem.h>
 #include <linux/kthread.h>
 #include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
 #include <linux/node.h>
 #include <linux/pagemap.h>
 #include <linux/ratelimit.h>
@@ -17,12 +18,9 @@
 #include "driver.h"
 #include "encl.h"
 #include "encls.h"
+#include "epc_cgroup.h"
 
-/*
- * Maximum number of pages to scan for reclaiming.
- */
-#define SGX_NR_TO_SCAN_MAX	32
-
+u64 sgx_epc_total_pages;
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
 static struct task_struct *ksgxd_tsk;
@@ -37,11 +35,17 @@ static struct sgx_epc_lru_lists sgx_global_lru;
 
 static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
 {
+	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		return epc_cg_lru(epc_page->epc_cg);
+
 	return &sgx_global_lru;
 }
 
 static inline bool sgx_can_reclaim(void)
 {
+	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		return !sgx_epc_cgroup_lru_empty(NULL);
+
 	return !list_empty(&sgx_global_lru.reclaimable);
 }
 
@@ -300,14 +304,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * @nr_to_scan:	Number of pages to scan for reclaim
  * @dst:	Destination list to hold the isolated pages
  */
-void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t *nr_to_scan,
 			   struct list_head *dst)
 {
 	struct sgx_encl_page *encl_page;
 	struct sgx_epc_page *epc_page;
 
 	spin_lock(&lru->lock);
-	for (; nr_to_scan > 0; --nr_to_scan) {
+	for (; *nr_to_scan > 0; --(*nr_to_scan)) {
 		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
 		if (!epc_page)
 			break;
@@ -332,6 +336,7 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
  * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
  * @ignore_age:		 Reclaim a page even if it is young
+ * @epc_cg:		 EPC cgroup from which to reclaim
  *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
@@ -345,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
+			     struct sgx_epc_cgroup *epc_cg)
 {
 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
@@ -355,7 +361,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	LIST_HEAD(iso);
 	size_t ret, i;
 
-	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
+	/*
+	 * If a specific cgroup is not being targeted, take from the global
+	 * list first, even when cgroups are enabled.  If there are
+	 * pages on the global LRU then they should get reclaimed asap.
+	 */
+	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
+		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
+
+	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
 
 	if (list_empty(&iso))
 		return 0;
@@ -423,7 +437,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 }
 
 static int ksgxd(void *p)
@@ -446,7 +460,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 
 		cond_resched();
 	}
@@ -600,6 +614,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 {
 	struct sgx_epc_page *page;
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_try_charge(reclaim);
+	if (IS_ERR(epc_cg))
+		return ERR_CAST(epc_cg);
 
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
@@ -608,8 +627,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (!sgx_can_reclaim())
-			return ERR_PTR(-ENOMEM);
+		if (!sgx_can_reclaim()) {
+			page = ERR_PTR(-ENOMEM);
+			break;
+		}
 
 		if (!reclaim) {
 			page = ERR_PTR(-EBUSY);
@@ -621,10 +642,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 		cond_resched();
 	}
 
+	if (!IS_ERR(page)) {
+		WARN_ON_ONCE(page->epc_cg);
+		page->epc_cg = epc_cg;
+	} else {
+		sgx_epc_cgroup_uncharge(epc_cg);
+	}
+
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
 		wake_up(&ksgxd_waitq);
 
@@ -647,6 +675,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
 
+	if (page->epc_cg) {
+		sgx_epc_cgroup_uncharge(page->epc_cg);
+		page->epc_cg = NULL;
+	}
+
 	spin_lock(&node->lock);
 
 	page->encl_page = NULL;
@@ -657,6 +690,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	page->flags = SGX_EPC_PAGE_FREE;
 
 	spin_unlock(&node->lock);
+
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
@@ -826,6 +860,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 		section->pages[i].flags = 0;
 		section->pages[i].encl_page = NULL;
 		section->pages[i].poison = 0;
+		section->pages[i].epc_cg = NULL;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
@@ -970,6 +1005,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
 static bool __init sgx_page_cache_init(void)
 {
 	u32 eax, ebx, ecx, edx, type;
+	u64 capacity = 0;
 	u64 pa, size;
 	int nid;
 	int i;
@@ -1020,6 +1056,7 @@ static bool __init sgx_page_cache_init(void)
 
 		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
 		sgx_numa_nodes[nid].size += size;
+		capacity += size;
 
 		sgx_nr_epc_sections++;
 	}
@@ -1029,6 +1066,9 @@ static bool __init sgx_page_cache_init(void)
 		return false;
 	}
 
+	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+	sgx_epc_total_pages = capacity >> PAGE_SHIFT;
+
 	return true;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 42075762084c..1b90a905a9e2 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -19,6 +19,11 @@
 
 #define SGX_MAX_EPC_SECTIONS		8
 #define SGX_EEXTEND_BLOCK_SIZE		256
+
+/*
+ * Maximum number of pages to scan for reclaiming.
+ */
+#define SGX_NR_TO_SCAN_MAX		32UL
 #define SGX_NR_TO_SCAN			16
 #define SGX_NR_LOW_PAGES		32
 #define SGX_NR_HIGH_PAGES		64
@@ -70,6 +75,8 @@ enum sgx_epc_page_state {
 /* flag for pages owned by a sgx_encl struct */
 #define SGX_EPC_OWNER_ENCL		BIT(4)
 
+struct sgx_epc_cgroup;
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
@@ -81,6 +88,7 @@ struct sgx_epc_page {
 		struct sgx_encl *encl;
 	};
 	struct list_head list;
+	struct sgx_epc_cgroup *epc_cg;
 };
 
 static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
@@ -129,6 +137,7 @@ struct sgx_epc_section {
 	struct sgx_numa_node *node;
 };
 
+extern u64 sgx_epc_total_pages;
 extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 
 static inline unsigned long sgx_get_epc_phys_addr(struct sgx_epc_page *page)
@@ -152,7 +161,8 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 }
 
 /*
- * Contains EPC pages tracked by the reclaimer (ksgxd).
+ * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
+ * cgroup.
  */
 struct sgx_epc_lru_lists {
 	spinlock_t lock;
@@ -179,8 +189,9 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
-size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
-void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
+			     struct sgx_epc_cgroup *epc_cg);
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t *nr_to_scan,
 			   struct list_head *dst);
 
 void sgx_ipi_cb(void *info);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 17/18] Docs/x86/sgx: Add description for cgroup support
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang<haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang<haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
---
V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 82 ++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..65c211bd5342 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,85 @@ to expected failures and handle them as follows:
    first call.  It indicates a bug in the kernel or the userspace client
    if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
    a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that
+is used to provide SGX-enabled applications with protected memory,
+and is otherwise inaccessible, i.e. shows up as reserved in
+/proc/iomem and cannot be read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM,
+for all intents and purposes the EPC is independent from normal system
+memory, e.g. must be reserved at boot from RAM and cannot be converted
+between EPC and normal memory while the system is running.  The EPC is
+managed by the SGX subsystem and is not accounted by the memory
+controller.  Note that this is true only for EPC memory itself, i.e.
+normal memory allocations related to SGX and EPC memory, e.g. the
+backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via
+virtual memory techniques and pages can be swapped out of the EPC
+to their backing store (normal system memory allocated via shmem).
+The SGX EPC subsystem is analogous to the memory subsystem, and
+it implements limit and protection models for EPC memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface
+files, please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated
+otherwise.  If a value which is not PAGE_SIZE aligned is written,
+the actual value used by the controller will be rounded down to
+the closest PAGE_SIZE multiple.
+
+  misc.capacity
+        A read-only flat-keyed file shown only in the root cgroup.
+        The sgx_epc resource will show the total amount of EPC
+        memory available on the platform.
+
+  misc.current
+        A read-only flat-keyed file shown in the non-root cgroups.
+        The sgx_epc resource will show the current active EPC memory
+        usage of the cgroup and its descendants. EPC pages that are
+        swapped out to backing RAM are not included in the current count.
+
+  misc.max
+        A read-write single value file which exists on non-root
+        cgroups. The sgx_epc resource will show the EPC usage
+        hard limit. The default is "max".
+
+        If a cgroup's EPC usage reaches this limit, EPC allocations,
+        e.g. for page fault handling, will be blocked until EPC can
+        be reclaimed from the cgroup.  If EPC cannot be reclaimed in
+        a timely manner, reclaim will be forced, e.g. by ignoring LRU.
+
+        The EPC pages allocated for KVM guests by the virtual EPC driver
+        are not reclaimable by the host kernel SGX reclaimers. If a VMM
+        tries to start a VM within a cgroup whose EPC usage reaches this
+        limit, the virtual EPC driver will stop allocating more EPC for the
+        VM, and return SIGBUS to the VMM which would abort the VM launch.
+
+  misc.events
+        A read-only flat-keyed file which exists on non-root cgroups.
+        A value change in this file generates a file modified event.
+
+          max
+                The number of times the cgroup has triggered a reclaim
+                due to its EPC usage approaching (or exceeding) its max
+                EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it
+remains charged to the original cgroup until the page is released
+or reclaimed.  Migrating a process to a different cgroup doesn't
+move the EPC charges that it incurred while in the previous cgroup
+to its new cgroup.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 17/18] Docs/x86/sgx: Add description for cgroup support
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Co-developed-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Co-developed-by: Haitao Huang<haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang<haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Bagas Sanjaya <bagasdotme-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 82 ++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..65c211bd5342 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,85 @@ to expected failures and handle them as follows:
    first call.  It indicates a bug in the kernel or the userspace client
    if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
    a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that
+is used to provide SGX-enabled applications with protected memory,
+and is otherwise inaccessible, i.e. shows up as reserved in
+/proc/iomem and cannot be read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM,
+for all intents and purposes the EPC is independent from normal system
+memory, e.g. must be reserved at boot from RAM and cannot be converted
+between EPC and normal memory while the system is running.  The EPC is
+managed by the SGX subsystem and is not accounted by the memory
+controller.  Note that this is true only for EPC memory itself, i.e.
+normal memory allocations related to SGX and EPC memory, e.g. the
+backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via
+virtual memory techniques and pages can be swapped out of the EPC
+to their backing store (normal system memory allocated via shmem).
+The SGX EPC subsystem is analogous to the memory subsystem, and
+it implements limit and protection models for EPC memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface
+files, please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated
+otherwise.  If a value which is not PAGE_SIZE aligned is written,
+the actual value used by the controller will be rounded down to
+the closest PAGE_SIZE multiple.
+
+  misc.capacity
+        A read-only flat-keyed file shown only in the root cgroup.
+        The sgx_epc resource will show the total amount of EPC
+        memory available on the platform.
+
+  misc.current
+        A read-only flat-keyed file shown in the non-root cgroups.
+        The sgx_epc resource will show the current active EPC memory
+        usage of the cgroup and its descendants. EPC pages that are
+        swapped out to backing RAM are not included in the current count.
+
+  misc.max
+        A read-write single value file which exists on non-root
+        cgroups. The sgx_epc resource will show the EPC usage
+        hard limit. The default is "max".
+
+        If a cgroup's EPC usage reaches this limit, EPC allocations,
+        e.g. for page fault handling, will be blocked until EPC can
+        be reclaimed from the cgroup.  If EPC cannot be reclaimed in
+        a timely manner, reclaim will be forced, e.g. by ignoring LRU.
+
+        The EPC pages allocated for KVM guests by the virtual EPC driver
+        are not reclaimable by the host kernel SGX reclaimers. If a VMM
+        tries to start a VM within a cgroup whose EPC usage reaches this
+        limit, the virtual EPC driver will stop allocating more EPC for the
+        VM, and return SIGBUS to the VMM which would abort the VM launch.
+
+  misc.events
+        A read-only flat-keyed file which exists on non-root cgroups.
+        A value change in this file generates a file modified event.
+
+          max
+                The number of times the cgroup has triggered a reclaim
+                due to its EPC usage approaching (or exceeding) its max
+                EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it
+remains charged to the original cgroup until the page is released
+or reclaimed.  Migrating a process to a different cgroup doesn't
+move the EPC charges that it incurred while in the previous cgroup
+to its new cgroup.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 18/18] selftests/sgx: Add scripts for EPC cgroup testing
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

The scripts rely on cgroup-tools package from libcgroup [1].

To run selftests for epc cgroup:

sudo ./run_epc_cg_selftests.sh

With different cgroups, the script starts one or multiple concurrent SGX
selftests, each to run one unclobbered_vdso_oversubscribed test.
Each of such test tries to load an enclave of size equal to the EPC
capacity available on the platform. The script checks results against
the expectation set for each cgroup and report success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script start 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as capacity, large enough to run lots of
concurrent tests. The script starts 10 of them and expects all pass.

To watch misc cgroup 'current' changes during testing, run this in a
separate terminal:

./watch_misc_for_tests.sh current

[1] https://github.com/libcgroup/libcgroup/blob/main/README

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
V5:

- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.

V4:

Note: Need to apply on top of this series previously reviewed:
https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko@kernel.org/
---
 .../selftests/sgx/run_epc_cg_selftests.sh     | 147 ++++++++++++++++++
 .../selftests/sgx/watch_misc_for_tests.sh     |  13 ++
 2 files changed, 160 insertions(+)
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index 000000000000..410c97ee6e18
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,147 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+TEST_ROOT_CG=selftest
+cgcreate -g misc:$TEST_ROOT_CG
+if [ $? -ne 0 ]; then
+    echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted."
+    exit 1
+fi
+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+cgcreate -g misc:$TEST_CG_SUB1
+cgcreate -g misc:$TEST_CG_SUB2
+cgcreate -g misc:$TEST_CG_SUB3
+cgcreate -g misc:$TEST_CG_SUB4
+
+# Default to V2
+CG_ROOT=/sys/fs/cgroup
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+    echo "# cgroup V2 is in use."
+else
+    echo "# cgroup V1 is in use."
+    CG_ROOT=/sys/fs/cgroup/misc
+fi
+
+CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}')
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+SMALL=$(( CAPACITY / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+LARGE=$(( SMALL * 4 ))
+
+# Load lots of enclaves
+LARGER=$CAPACITY
+echo "# Setting up limits."
+echo "sgx_epc $SMALL" | tee $CG_ROOT/$TEST_CG_SUB1/misc.max
+echo "sgx_epc $LARGE" | tee $CG_ROOT/$TEST_CG_SUB2/misc.max
+echo "sgx_epc $LARGER" | tee $CG_ROOT/$TEST_CG_SUB4/misc.max
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..."
+# Always use leaf node of misc cgroups so it works for both v1 and v2
+# these may fail on OOM
+cgexec -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1 
+if [[ $? -eq 0 ]]; then
+    echo "# Fail on SMALL limit, not expecting any test passes."
+    cgdelete -r -g misc:$TEST_ROOT_CG
+    exit 1
+else
+    echo "# Test failed as expected."
+fi
+
+echo "# PASSED SMALL limit."
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit, expecting at least one success...."
+pids=()
+for i in {1..4}; do
+    (
+        cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
+    ) &
+    pids+=($!)
+done
+
+any_success=0
+for pid in "${pids[@]}"; do
+    wait "$pid"
+    status=$?
+    if [[ $status -eq 0 ]]; then
+        any_success=1
+	echo "# Process $pid returned successfully."
+    fi
+done
+
+if [[ $any_success -eq 0 ]]; then
+    echo "# Failed on LARGE limit positive testing, no test passes."
+    cgdelete -r -g misc:$TEST_ROOT_CG
+    exit 1
+fi
+
+echo "# PASSED LARGE limit positive testing."
+
+echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit, expecting at least one failure...."
+pids=()
+for i in {1..5}; do
+    (
+        cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
+    ) &
+    pids+=($!)
+done
+
+any_failure=0
+for pid in "${pids[@]}"; do
+    wait "$pid"
+    status=$?
+    if [[ $status -ne 0 ]]; then
+	echo "# Process $pid returned failure."
+        any_failure=1
+    fi
+done
+
+if [[ $any_failure -eq 0 ]]; then
+    echo "# Failed on LARGE limit negative testing, no test fails."
+    cgdelete -r -g misc:$TEST_ROOT_CG
+    exit 1
+fi
+
+echo "# PASSED LARGE limit negative testing."
+
+echo "# Start 10 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit, expecting no failure...."
+pids=()
+for i in {1..10}; do
+    (
+        cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
+    ) &
+    pids+=($!)
+done
+
+any_failure=0
+for pid in "${pids[@]}"; do
+    wait "$pid"
+    status=$?
+    if [[ $status -ne 0 ]]; then
+	echo "# Process $pid returned failure."
+        any_failure=1
+    fi
+done
+
+if [[ $any_failure -ne 0 ]]; then
+    echo "# Failed on LARGER limit, at least one test fails."
+    cgdelete -r -g misc:$TEST_ROOT_CG
+    exit 1
+fi
+
+echo "# PASSED LARGER limit tests."
+
+echo "# PASSED ALL cgroup limit tests, cleanup cgroups..."
+cgdelete -r -g misc:$TEST_ROOT_CG
+echo "# done."
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
new file mode 100755
index 000000000000..dbd38f346e7b
--- /dev/null
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+if [ -z "$1" ]
+  then
+    echo "No argument supplied, please provide 'max', 'current' or 'events'"
+    exit 1
+fi
+
+watch -n 1 "find /sys/fs/cgroup -wholename */test*/misc.$1 -exec sh -c \
+    'echo \"\$1:\"; cat \"\$1\"' _ {} \;"
+
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH v5 18/18] selftests/sgx: Add scripts for EPC cgroup testing
@ 2023-09-23  3:06   ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-09-23  3:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

The scripts rely on cgroup-tools package from libcgroup [1].

To run selftests for epc cgroup:

sudo ./run_epc_cg_selftests.sh

With different cgroups, the script starts one or multiple concurrent SGX
selftests, each to run one unclobbered_vdso_oversubscribed test.
Each of such test tries to load an enclave of size equal to the EPC
capacity available on the platform. The script checks results against
the expectation set for each cgroup and report success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script start 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as capacity, large enough to run lots of
concurrent tests. The script starts 10 of them and expects all pass.

To watch misc cgroup 'current' changes during testing, run this in a
separate terminal:

./watch_misc_for_tests.sh current

[1] https://github.com/libcgroup/libcgroup/blob/main/README

Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
V5:

- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.

V4:

Note: Need to apply on top of this series previously reviewed:
https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org/
---
 .../selftests/sgx/run_epc_cg_selftests.sh     | 147 ++++++++++++++++++
 .../selftests/sgx/watch_misc_for_tests.sh     |  13 ++
 2 files changed, 160 insertions(+)
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index 000000000000..410c97ee6e18
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,147 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+TEST_ROOT_CG=selftest
+cgcreate -g misc:$TEST_ROOT_CG
+if [ $? -ne 0 ]; then
+    echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted."
+    exit 1
+fi
+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+cgcreate -g misc:$TEST_CG_SUB1
+cgcreate -g misc:$TEST_CG_SUB2
+cgcreate -g misc:$TEST_CG_SUB3
+cgcreate -g misc:$TEST_CG_SUB4
+
+# Default to V2
+CG_ROOT=/sys/fs/cgroup
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+    echo "# cgroup V2 is in use."
+else
+    echo "# cgroup V1 is in use."
+    CG_ROOT=/sys/fs/cgroup/misc
+fi
+
+CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}')
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+SMALL=$(( CAPACITY / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+LARGE=$(( SMALL * 4 ))
+
+# Load lots of enclaves
+LARGER=$CAPACITY
+echo "# Setting up limits."
+echo "sgx_epc $SMALL" | tee $CG_ROOT/$TEST_CG_SUB1/misc.max
+echo "sgx_epc $LARGE" | tee $CG_ROOT/$TEST_CG_SUB2/misc.max
+echo "sgx_epc $LARGER" | tee $CG_ROOT/$TEST_CG_SUB4/misc.max
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..."
+# Always use leaf node of misc cgroups so it works for both v1 and v2
+# these may fail on OOM
+cgexec -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1 
+if [[ $? -eq 0 ]]; then
+    echo "# Fail on SMALL limit, not expecting any test passes."
+    cgdelete -r -g misc:$TEST_ROOT_CG
+    exit 1
+else
+    echo "# Test failed as expected."
+fi
+
+echo "# PASSED SMALL limit."
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit, expecting at least one success...."
+pids=()
+for i in {1..4}; do
+    (
+        cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
+    ) &
+    pids+=($!)
+done
+
+any_success=0
+for pid in "${pids[@]}"; do
+    wait "$pid"
+    status=$?
+    if [[ $status -eq 0 ]]; then
+        any_success=1
+	echo "# Process $pid returned successfully."
+    fi
+done
+
+if [[ $any_success -eq 0 ]]; then
+    echo "# Failed on LARGE limit positive testing, no test passes."
+    cgdelete -r -g misc:$TEST_ROOT_CG
+    exit 1
+fi
+
+echo "# PASSED LARGE limit positive testing."
+
+echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit, expecting at least one failure...."
+pids=()
+for i in {1..5}; do
+    (
+        cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
+    ) &
+    pids+=($!)
+done
+
+any_failure=0
+for pid in "${pids[@]}"; do
+    wait "$pid"
+    status=$?
+    if [[ $status -ne 0 ]]; then
+	echo "# Process $pid returned failure."
+        any_failure=1
+    fi
+done
+
+if [[ $any_failure -eq 0 ]]; then
+    echo "# Failed on LARGE limit negative testing, no test fails."
+    cgdelete -r -g misc:$TEST_ROOT_CG
+    exit 1
+fi
+
+echo "# PASSED LARGE limit negative testing."
+
+echo "# Start 10 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit, expecting no failure...."
+pids=()
+for i in {1..10}; do
+    (
+        cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
+    ) &
+    pids+=($!)
+done
+
+any_failure=0
+for pid in "${pids[@]}"; do
+    wait "$pid"
+    status=$?
+    if [[ $status -ne 0 ]]; then
+	echo "# Process $pid returned failure."
+        any_failure=1
+    fi
+done
+
+if [[ $any_failure -ne 0 ]]; then
+    echo "# Failed on LARGER limit, at least one test fails."
+    cgdelete -r -g misc:$TEST_ROOT_CG
+    exit 1
+fi
+
+echo "# PASSED LARGER limit tests."
+
+echo "# PASSED ALL cgroup limit tests, cleanup cgroups..."
+cgdelete -r -g misc:$TEST_ROOT_CG
+echo "# done."
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
new file mode 100755
index 000000000000..dbd38f346e7b
--- /dev/null
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+if [ -z "$1" ]
+  then
+    echo "No argument supplied, please provide 'max', 'current' or 'events'"
+    exit 1
+fi
+
+watch -n 1 "find /sys/fs/cgroup -wholename */test*/misc.$1 -exec sh -c \
+    'echo \"\$1:\"; cat \"\$1\"' _ {} \;"
+
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
@ 2023-09-25 17:09     ` Jarkko Sakkinen
  0 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-25 17:09 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>
> The misc cgroup controller (subsystem) currently does not perform
> resource type specific action for Cgroups Subsystem State (CSS) events:
> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> when a cgroup is destroyed, or in event of user writing the max value to
> the misc.max file to set the usage limit of a specific resource
> [admin-guide/cgroup-v2.rst, 5-9. Misc].
>
> Define callbacks for those events and allow resource providers to
> register the callbacks per resource type as needed. This will be
> utilized later by the EPC misc cgroup support implemented in the SGX
> driver:
> - On css_alloc, allocate and initialize necessary structures for EPC
> reclaiming, e.g., LRU list, work queue, etc.
> - On css_free, cleanup and free those structures created in alloc.
> - On max_write, trigger EPC reclaiming if the new limit is at or below
> current usage.
>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> ---
> V5:
> - Remove prefixes from the callback names (tj)
> - Update commit message (Jarkko)
>
> V4:
> - Moved this to the front of the series.
> - Applies on cgroup/for-6.6 with the overflow fix for misc.
>
> V3:
> - Removed the released() callback
> ---
>  include/linux/misc_cgroup.h |  5 +++++
>  kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
>  2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> index e799b1f8d05b..96a88822815a 100644
> --- a/include/linux/misc_cgroup.h
> +++ b/include/linux/misc_cgroup.h
> @@ -37,6 +37,11 @@ struct misc_res {
>  	u64 max;
>  	atomic64_t usage;
>  	atomic64_t events;
> +
> +	/* per resource callback ops */
> +	int (*alloc)(struct misc_cg *cg);
> +	void (*free)(struct misc_cg *cg);
> +	void (*max_write)(struct misc_cg *cg);
>  };
>  
>  /**
> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> index 79a3717a5803..62c9198dee21 100644
> --- a/kernel/cgroup/misc.c
> +++ b/kernel/cgroup/misc.c
> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
>  
>  	cg = css_misc(of_css(of));
>  
> -	if (READ_ONCE(misc_res_capacity[type]))
> +	if (READ_ONCE(misc_res_capacity[type])) {
>  		WRITE_ONCE(cg->res[type].max, max);
> -	else
> +		if (cg->res[type].max_write)
> +			cg->res[type].max_write(cg);
> +	} else {
>  		ret = -EINVAL;
> +	}
>  
>  	return ret ? ret : nbytes;
>  }
> @@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
>  static struct cgroup_subsys_state *
>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>  {
> +	struct misc_cg *parent_cg;
>  	enum misc_res_type i;
>  	struct misc_cg *cg;
> +	int ret;
>  
>  	if (!parent_css) {
>  		cg = &root_cg;
> +		parent_cg = &root_cg;
>  	} else {
>  		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
>  		if (!cg)
>  			return ERR_PTR(-ENOMEM);
> +		parent_cg = css_misc(parent_css);
>  	}
>  
>  	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
>  		WRITE_ONCE(cg->res[i].max, MAX_NUM);
>  		atomic64_set(&cg->res[i].usage, 0);
> +		if (parent_cg->res[i].alloc) {
> +			ret = parent_cg->res[i].alloc(cg);
> +			if (ret)
> +				goto alloc_err;
> +		}
>  	}
>  
>  	return &cg->css;
> +
> +alloc_err:
> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> +		if (parent_cg->res[i].free)
> +			cg->res[i].free(cg);
> +	kfree(cg);
> +	return ERR_PTR(ret);
>  }
>  
>  /**
> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>   */
>  static void misc_cg_free(struct cgroup_subsys_state *css)
>  {
> -	kfree(css_misc(css));
> +	struct misc_cg *cg = css_misc(css);
> +	enum misc_res_type i;
> +
> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> +		if (cg->res[i].free)
> +			cg->res[i].free(cg);
> +
> +	kfree(cg);
>  }
>  
>  /* Cgroup controller callbacks */
> -- 
> 2.25.1

Since the only existing client feature requires all callbacks, should
this not have that as an invariant?

I.e. it might be better to fail unless *all* ops are non-nil (e.g. to
catch issues in the kernel code).

BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
@ 2023-09-25 17:09     ` Jarkko Sakkinen
  0 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-25 17:09 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> The misc cgroup controller (subsystem) currently does not perform
> resource type specific action for Cgroups Subsystem State (CSS) events:
> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> when a cgroup is destroyed, or in event of user writing the max value to
> the misc.max file to set the usage limit of a specific resource
> [admin-guide/cgroup-v2.rst, 5-9. Misc].
>
> Define callbacks for those events and allow resource providers to
> register the callbacks per resource type as needed. This will be
> utilized later by the EPC misc cgroup support implemented in the SGX
> driver:
> - On css_alloc, allocate and initialize necessary structures for EPC
> reclaiming, e.g., LRU list, work queue, etc.
> - On css_free, cleanup and free those structures created in alloc.
> - On max_write, trigger EPC reclaiming if the new limit is at or below
> current usage.
>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> ---
> V5:
> - Remove prefixes from the callback names (tj)
> - Update commit message (Jarkko)
>
> V4:
> - Moved this to the front of the series.
> - Applies on cgroup/for-6.6 with the overflow fix for misc.
>
> V3:
> - Removed the released() callback
> ---
>  include/linux/misc_cgroup.h |  5 +++++
>  kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
>  2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> index e799b1f8d05b..96a88822815a 100644
> --- a/include/linux/misc_cgroup.h
> +++ b/include/linux/misc_cgroup.h
> @@ -37,6 +37,11 @@ struct misc_res {
>  	u64 max;
>  	atomic64_t usage;
>  	atomic64_t events;
> +
> +	/* per resource callback ops */
> +	int (*alloc)(struct misc_cg *cg);
> +	void (*free)(struct misc_cg *cg);
> +	void (*max_write)(struct misc_cg *cg);
>  };
>  
>  /**
> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> index 79a3717a5803..62c9198dee21 100644
> --- a/kernel/cgroup/misc.c
> +++ b/kernel/cgroup/misc.c
> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
>  
>  	cg = css_misc(of_css(of));
>  
> -	if (READ_ONCE(misc_res_capacity[type]))
> +	if (READ_ONCE(misc_res_capacity[type])) {
>  		WRITE_ONCE(cg->res[type].max, max);
> -	else
> +		if (cg->res[type].max_write)
> +			cg->res[type].max_write(cg);
> +	} else {
>  		ret = -EINVAL;
> +	}
>  
>  	return ret ? ret : nbytes;
>  }
> @@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
>  static struct cgroup_subsys_state *
>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>  {
> +	struct misc_cg *parent_cg;
>  	enum misc_res_type i;
>  	struct misc_cg *cg;
> +	int ret;
>  
>  	if (!parent_css) {
>  		cg = &root_cg;
> +		parent_cg = &root_cg;
>  	} else {
>  		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
>  		if (!cg)
>  			return ERR_PTR(-ENOMEM);
> +		parent_cg = css_misc(parent_css);
>  	}
>  
>  	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
>  		WRITE_ONCE(cg->res[i].max, MAX_NUM);
>  		atomic64_set(&cg->res[i].usage, 0);
> +		if (parent_cg->res[i].alloc) {
> +			ret = parent_cg->res[i].alloc(cg);
> +			if (ret)
> +				goto alloc_err;
> +		}
>  	}
>  
>  	return &cg->css;
> +
> +alloc_err:
> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> +		if (parent_cg->res[i].free)
> +			cg->res[i].free(cg);
> +	kfree(cg);
> +	return ERR_PTR(ret);
>  }
>  
>  /**
> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>   */
>  static void misc_cg_free(struct cgroup_subsys_state *css)
>  {
> -	kfree(css_misc(css));
> +	struct misc_cg *cg = css_misc(css);
> +	enum misc_res_type i;
> +
> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> +		if (cg->res[i].free)
> +			cg->res[i].free(cg);
> +
> +	kfree(cg);
>  }
>  
>  /* Cgroup controller callbacks */
> -- 
> 2.25.1

Since the only existing client feature requires all callbacks, should
this not have that as an invariant?

I.e. it might be better to fail unless *all* ops are non-nil (e.g. to
catch issues in the kernel code).

BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 06/18] x86/sgx: Introduce EPC page states
@ 2023-09-25 17:11     ` Jarkko Sakkinen
  0 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-25 17:11 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> Use the lower 3 bits in the flags field of sgx_epc_page struct to
> track EPC states in its life cycle and define an enum for possible
> states. More state(s) will be added later.
>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> ---
> V4:
> - No changes other than required for patch reordering.
>
> V3:
> - This is new in V3 to replace the bit mask based approach (requested by Jarkko)
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 14 +++++++---
>  arch/x86/kernel/cpu/sgx/ioctl.c |  7 +++--
>  arch/x86/kernel/cpu/sgx/main.c  | 19 +++++++------
>  arch/x86/kernel/cpu/sgx/sgx.h   | 49 ++++++++++++++++++++++++++++++---
>  4 files changed, 71 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 97a53e34a8b4..f5afc8d65e22 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -244,8 +244,12 @@ static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
>  {
>  	struct sgx_epc_page *epc_page = encl->secs.epc_page;
>  
> -	if (!epc_page)
> +	if (!epc_page) {
>  		epc_page = sgx_encl_eldu(&encl->secs, NULL);
> +		if (!IS_ERR(epc_page))
> +			sgx_record_epc_page(epc_page,
> +					    SGX_EPC_PAGE_UNRECLAIMABLE);

Can be a single line probably (less than 100 characters).

> +	}
>  
>  	return epc_page;
>  }
> @@ -272,7 +276,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>  		return ERR_CAST(epc_page);
>  
>  	encl->secs_child_cnt++;
> -	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  
>  	return entry;
>  }
> @@ -398,7 +402,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
>  	encl_page->type = SGX_PAGE_TYPE_REG;
>  	encl->secs_child_cnt++;
>  
> -	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  
>  	phys_addr = sgx_get_epc_phys_addr(epc_page);
>  	/*
> @@ -1256,6 +1260,8 @@ struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
>  		sgx_encl_free_epc_page(epc_page);
>  		return ERR_PTR(-EFAULT);
>  	}
> +	sgx_record_epc_page(epc_page,
> +			    SGX_EPC_PAGE_UNRECLAIMABLE);

There is bunch of these apparently.

>  
>  	return epc_page;
>  }
> @@ -1315,7 +1321,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
>  {
>  	int ret;
>  
> -	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
>  
>  	ret = __eremove(sgx_get_epc_virt_addr(page));
>  	if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index a75eb44022a3..9a32bf5a1070 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
>  	encl->attributes = secs->attributes;
>  	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
>  
> +	sgx_record_epc_page(encl->secs.epc_page,
> +			    SGX_EPC_PAGE_UNRECLAIMABLE);
> +
>  	/* Set only after completion, as encl->lock has not been taken. */
>  	set_bit(SGX_ENCL_CREATED, &encl->flags);
>  
> @@ -322,7 +325,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
>  			goto err_out;
>  	}
>  
> -	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  	mutex_unlock(&encl->lock);
>  	mmap_read_unlock(current->mm);
>  	return ret;
> @@ -976,7 +979,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
>  
>  			mutex_lock(&encl->lock);
>  
> -			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  		}
>  
>  		/* Change EPC type */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index dec1d57cbff6..b26860399402 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -318,7 +318,7 @@ static void sgx_reclaim_pages(void)
>  			/* The owner is freeing the page. No need to add the
>  			 * page back to the list of reclaimable pages.
>  			 */
> -			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +			sgx_epc_page_reset_state(epc_page);
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> @@ -344,6 +344,7 @@ static void sgx_reclaim_pages(void)
>  
>  skip:
>  		spin_lock(&sgx_global_lru.lock);
> +		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
>  		spin_unlock(&sgx_global_lru.lock);
>  
> @@ -367,7 +368,7 @@ static void sgx_reclaim_pages(void)
>  		sgx_reclaimer_write(epc_page, &backing[i]);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> -		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +		sgx_epc_page_reset_state(epc_page);
>  
>  		sgx_free_epc_page(epc_page);
>  	}
> @@ -507,9 +508,9 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
>  	page->flags |= flags;
> -	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
> +	if (sgx_epc_page_reclaimable(flags))
>  		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
>  	spin_unlock(&sgx_global_lru.lock);
>  }
> @@ -527,7 +528,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  int sgx_drop_epc_page(struct sgx_epc_page *page)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> +	if (sgx_epc_page_reclaimable(page->flags)) {
>  		/* The page is being reclaimed. */
>  		if (list_empty(&page->list)) {
>  			spin_unlock(&sgx_global_lru.lock);
> @@ -535,7 +536,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>  		}
>  
>  		list_del(&page->list);
> -		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +		sgx_epc_page_reset_state(page);
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> @@ -607,6 +608,8 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
>  	struct sgx_numa_node *node = section->node;
>  
> +	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
> +
>  	spin_lock(&node->lock);
>  
>  	page->owner = NULL;
> @@ -614,7 +617,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  		list_add(&page->list, &node->sgx_poison_page_list);
>  	else
>  		list_add_tail(&page->list, &node->free_page_list);
> -	page->flags = SGX_EPC_PAGE_IS_FREE;
> +	page->flags = SGX_EPC_PAGE_FREE;
>  
>  	spin_unlock(&node->lock);
>  	atomic_long_inc(&sgx_nr_free_pages);
> @@ -715,7 +718,7 @@ int arch_memory_failure(unsigned long pfn, int flags)
>  	 * If the page is on a free list, move it to the per-node
>  	 * poison page list.
>  	 */
> -	if (page->flags & SGX_EPC_PAGE_IS_FREE) {
> +	if (page->flags == SGX_EPC_PAGE_FREE) {
>  		list_move(&page->list, &node->sgx_poison_page_list);
>  		goto out;
>  	}
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 113d930fd087..2faeb40b345f 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -23,11 +23,36 @@
>  #define SGX_NR_LOW_PAGES		32
>  #define SGX_NR_HIGH_PAGES		64
>  
> -/* Pages, which are being tracked by the page reclaimer. */
> -#define SGX_EPC_PAGE_RECLAIMER_TRACKED	BIT(0)
> +enum sgx_epc_page_state {
> +	/* Not tracked by the reclaimer:
> +	 * Pages allocated for virtual EPC which are never tracked by the host
> +	 * reclaimer; pages just allocated from free list but not yet put in
> +	 * use; pages just reclaimed, but not yet returned to the free list.
> +	 * Becomes FREE after sgx_free_epc()
> +	 * Becomes RECLAIMABLE or UNRECLAIMABLE after sgx_record_epc()
> +	 */
> +	SGX_EPC_PAGE_NOT_TRACKED = 0,
> +
> +	/* Page is in the free list, ready for allocation
> +	 * Becomes NOT_TRACKED after sgx_alloc_epc_page()
> +	 */
> +	SGX_EPC_PAGE_FREE = 1,
> +
> +	/* Page is in use and tracked in a reclaimable LRU list
> +	 * Becomes NOT_TRACKED after sgx_drop_epc()
> +	 */
> +	SGX_EPC_PAGE_RECLAIMABLE = 2,
> +
> +	/* Page is in use but tracked in an unreclaimable LRU list. These are
> +	 * only reclaimable when the whole enclave is OOM killed or the enclave
> +	 * is released, e.g., VA, SECS pages
> +	 * Becomes NOT_TRACKED after sgx_drop_epc()
> +	 */
> +	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
>  
> -/* Pages on free list */
> -#define SGX_EPC_PAGE_IS_FREE		BIT(1)
> +};
> +
> +#define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
>  
>  struct sgx_epc_page {
>  	unsigned int section;
> @@ -37,6 +62,22 @@ struct sgx_epc_page {
>  	struct list_head list;
>  };
>  
> +static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
> +{
> +	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
> +}
> +
> +static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned long flags)
> +{
> +	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
> +	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
> +}
> +
> +static inline bool sgx_epc_page_reclaimable(unsigned long flags)
> +{
> +	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
> +}
> +
>  /*
>   * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
>   * the free page list local to the node is stored here.
> -- 
> 2.25.1

BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 06/18] x86/sgx: Introduce EPC page states
@ 2023-09-25 17:11     ` Jarkko Sakkinen
  0 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-25 17:11 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> Use the lower 3 bits in the flags field of sgx_epc_page struct to
> track EPC states in its life cycle and define an enum for possible
> states. More state(s) will be added later.
>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> ---
> V4:
> - No changes other than required for patch reordering.
>
> V3:
> - This is new in V3 to replace the bit mask based approach (requested by Jarkko)
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 14 +++++++---
>  arch/x86/kernel/cpu/sgx/ioctl.c |  7 +++--
>  arch/x86/kernel/cpu/sgx/main.c  | 19 +++++++------
>  arch/x86/kernel/cpu/sgx/sgx.h   | 49 ++++++++++++++++++++++++++++++---
>  4 files changed, 71 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 97a53e34a8b4..f5afc8d65e22 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -244,8 +244,12 @@ static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
>  {
>  	struct sgx_epc_page *epc_page = encl->secs.epc_page;
>  
> -	if (!epc_page)
> +	if (!epc_page) {
>  		epc_page = sgx_encl_eldu(&encl->secs, NULL);
> +		if (!IS_ERR(epc_page))
> +			sgx_record_epc_page(epc_page,
> +					    SGX_EPC_PAGE_UNRECLAIMABLE);

Can be a single line probably (less than 100 characters).

> +	}
>  
>  	return epc_page;
>  }
> @@ -272,7 +276,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>  		return ERR_CAST(epc_page);
>  
>  	encl->secs_child_cnt++;
> -	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  
>  	return entry;
>  }
> @@ -398,7 +402,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
>  	encl_page->type = SGX_PAGE_TYPE_REG;
>  	encl->secs_child_cnt++;
>  
> -	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  
>  	phys_addr = sgx_get_epc_phys_addr(epc_page);
>  	/*
> @@ -1256,6 +1260,8 @@ struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
>  		sgx_encl_free_epc_page(epc_page);
>  		return ERR_PTR(-EFAULT);
>  	}
> +	sgx_record_epc_page(epc_page,
> +			    SGX_EPC_PAGE_UNRECLAIMABLE);

There is bunch of these apparently.

>  
>  	return epc_page;
>  }
> @@ -1315,7 +1321,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
>  {
>  	int ret;
>  
> -	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
>  
>  	ret = __eremove(sgx_get_epc_virt_addr(page));
>  	if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index a75eb44022a3..9a32bf5a1070 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
>  	encl->attributes = secs->attributes;
>  	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
>  
> +	sgx_record_epc_page(encl->secs.epc_page,
> +			    SGX_EPC_PAGE_UNRECLAIMABLE);
> +
>  	/* Set only after completion, as encl->lock has not been taken. */
>  	set_bit(SGX_ENCL_CREATED, &encl->flags);
>  
> @@ -322,7 +325,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
>  			goto err_out;
>  	}
>  
> -	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  	mutex_unlock(&encl->lock);
>  	mmap_read_unlock(current->mm);
>  	return ret;
> @@ -976,7 +979,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
>  
>  			mutex_lock(&encl->lock);
>  
> -			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  		}
>  
>  		/* Change EPC type */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index dec1d57cbff6..b26860399402 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -318,7 +318,7 @@ static void sgx_reclaim_pages(void)
>  			/* The owner is freeing the page. No need to add the
>  			 * page back to the list of reclaimable pages.
>  			 */
> -			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +			sgx_epc_page_reset_state(epc_page);
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> @@ -344,6 +344,7 @@ static void sgx_reclaim_pages(void)
>  
>  skip:
>  		spin_lock(&sgx_global_lru.lock);
> +		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
>  		spin_unlock(&sgx_global_lru.lock);
>  
> @@ -367,7 +368,7 @@ static void sgx_reclaim_pages(void)
>  		sgx_reclaimer_write(epc_page, &backing[i]);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> -		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +		sgx_epc_page_reset_state(epc_page);
>  
>  		sgx_free_epc_page(epc_page);
>  	}
> @@ -507,9 +508,9 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
>  	page->flags |= flags;
> -	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
> +	if (sgx_epc_page_reclaimable(flags))
>  		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
>  	spin_unlock(&sgx_global_lru.lock);
>  }
> @@ -527,7 +528,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  int sgx_drop_epc_page(struct sgx_epc_page *page)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> +	if (sgx_epc_page_reclaimable(page->flags)) {
>  		/* The page is being reclaimed. */
>  		if (list_empty(&page->list)) {
>  			spin_unlock(&sgx_global_lru.lock);
> @@ -535,7 +536,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>  		}
>  
>  		list_del(&page->list);
> -		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +		sgx_epc_page_reset_state(page);
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> @@ -607,6 +608,8 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
>  	struct sgx_numa_node *node = section->node;
>  
> +	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
> +
>  	spin_lock(&node->lock);
>  
>  	page->owner = NULL;
> @@ -614,7 +617,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  		list_add(&page->list, &node->sgx_poison_page_list);
>  	else
>  		list_add_tail(&page->list, &node->free_page_list);
> -	page->flags = SGX_EPC_PAGE_IS_FREE;
> +	page->flags = SGX_EPC_PAGE_FREE;
>  
>  	spin_unlock(&node->lock);
>  	atomic_long_inc(&sgx_nr_free_pages);
> @@ -715,7 +718,7 @@ int arch_memory_failure(unsigned long pfn, int flags)
>  	 * If the page is on a free list, move it to the per-node
>  	 * poison page list.
>  	 */
> -	if (page->flags & SGX_EPC_PAGE_IS_FREE) {
> +	if (page->flags == SGX_EPC_PAGE_FREE) {
>  		list_move(&page->list, &node->sgx_poison_page_list);
>  		goto out;
>  	}
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 113d930fd087..2faeb40b345f 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -23,11 +23,36 @@
>  #define SGX_NR_LOW_PAGES		32
>  #define SGX_NR_HIGH_PAGES		64
>  
> -/* Pages, which are being tracked by the page reclaimer. */
> -#define SGX_EPC_PAGE_RECLAIMER_TRACKED	BIT(0)
> +enum sgx_epc_page_state {
> +	/* Not tracked by the reclaimer:
> +	 * Pages allocated for virtual EPC which are never tracked by the host
> +	 * reclaimer; pages just allocated from free list but not yet put in
> +	 * use; pages just reclaimed, but not yet returned to the free list.
> +	 * Becomes FREE after sgx_free_epc()
> +	 * Becomes RECLAIMABLE or UNRECLAIMABLE after sgx_record_epc()
> +	 */
> +	SGX_EPC_PAGE_NOT_TRACKED = 0,
> +
> +	/* Page is in the free list, ready for allocation
> +	 * Becomes NOT_TRACKED after sgx_alloc_epc_page()
> +	 */
> +	SGX_EPC_PAGE_FREE = 1,
> +
> +	/* Page is in use and tracked in a reclaimable LRU list
> +	 * Becomes NOT_TRACKED after sgx_drop_epc()
> +	 */
> +	SGX_EPC_PAGE_RECLAIMABLE = 2,
> +
> +	/* Page is in use but tracked in an unreclaimable LRU list. These are
> +	 * only reclaimable when the whole enclave is OOM killed or the enclave
> +	 * is released, e.g., VA, SECS pages
> +	 * Becomes NOT_TRACKED after sgx_drop_epc()
> +	 */
> +	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
>  
> -/* Pages on free list */
> -#define SGX_EPC_PAGE_IS_FREE		BIT(1)
> +};
> +
> +#define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
>  
>  struct sgx_epc_page {
>  	unsigned int section;
> @@ -37,6 +62,22 @@ struct sgx_epc_page {
>  	struct list_head list;
>  };
>  
> +static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
> +{
> +	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
> +}
> +
> +static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned long flags)
> +{
> +	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
> +	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
> +}
> +
> +static inline bool sgx_epc_page_reclaimable(unsigned long flags)
> +{
> +	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
> +}
> +
>  /*
>   * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
>   * the free page list local to the node is stored here.
> -- 
> 2.25.1

BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state
  2023-09-23  3:06   ` Haitao Huang
@ 2023-09-25 17:13     ` Jarkko Sakkinen
  -1 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-25 17:13 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add RECLAIM_IN_PROGRESS state to not rely on list_empty(&epc_page->list)
> to determine if an EPC page is selected as a reclaiming candidate.
>
> When a page is being reclaimed from the page pool (sgx_global_lru),
> there is an intermediate stage where a page may have been identified as
> a candidate for reclaiming, but has not yet been reclaimed.  Currently
> such pages are list_del_init()'d from the global LRU list, and stored in
> a an array on stack. To prevent another thread from dropping the same
> page in the middle of reclaiming, sgx_drop_epc_page() checks for
> list_empty(&epc_page->list).
>
> A later patch will replace the array on stack with a temporary list to
> store the candidate pages, so list_empty() should no longer be used for
> this purpose.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
> V4:
> - Fixed some typos.
> - Revised commit message.
>
> V3:
> - Extend the sgx_epc_page_state enum introduced earlier to replace the
> flag based approach.
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 21 ++++++++++-----------
>  arch/x86/kernel/cpu/sgx/sgx.h  | 16 ++++++++++++++++
>  2 files changed, 26 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index b26860399402..c1ae19a154d0 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -312,13 +312,15 @@ static void sgx_reclaim_pages(void)
>  		list_del_init(&epc_page->list);
>  		encl_page = epc_page->owner;
>  
> -		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
> +		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
> +			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
>  			chunk[cnt++] = epc_page;
> -		else
> +		} else {
>  			/* The owner is freeing the page. No need to add the
>  			 * page back to the list of reclaimable pages.
>  			 */
>  			sgx_epc_page_reset_state(epc_page);
> +		}
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> @@ -528,16 +530,13 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  int sgx_drop_epc_page(struct sgx_epc_page *page)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	if (sgx_epc_page_reclaimable(page->flags)) {
> -		/* The page is being reclaimed. */
> -		if (list_empty(&page->list)) {
> -			spin_unlock(&sgx_global_lru.lock);
> -			return -EBUSY;
> -		}
> -
> -		list_del(&page->list);
> -		sgx_epc_page_reset_state(page);
> +	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
> +		spin_unlock(&sgx_global_lru.lock);
> +		return -EBUSY;
>  	}
> +
> +	list_del(&page->list);
> +	sgx_epc_page_reset_state(page);
>  	spin_unlock(&sgx_global_lru.lock);
>  
>  	return 0;
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 2faeb40b345f..764cec23f4e5 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -40,6 +40,8 @@ enum sgx_epc_page_state {
>  
>  	/* Page is in use and tracked in a reclaimable LRU list
>  	 * Becomes NOT_TRACKED after sgx_drop_epc()
> +	 * Becomes RECLAIM_IN_PROGRESS in sgx_reclaim_pages() when identified
> +	 * for reclaiming
>  	 */
>  	SGX_EPC_PAGE_RECLAIMABLE = 2,
>  
> @@ -50,6 +52,14 @@ enum sgx_epc_page_state {
>  	 */
>  	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
>  
> +	/* Page is being prepared for reclamation, tracked in a temporary
> +	 * isolated list by the reclaimer.
> +	 * Changes in sgx_reclaim_pages() back to RECLAIMABLE if preparation
> +	 * fails for any reason.
> +	 * Becomes NOT_TRACKED if reclaimed successfully in sgx_reclaim_pages()
> +	 * and immediately sgx_free_epc() is called to make it FREE.
> +	 */
> +	SGX_EPC_PAGE_RECLAIM_IN_PROGRESS = 4,
>  };
>  
>  #define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
> @@ -73,6 +83,12 @@ static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned lo
>  	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
>  }
>  
> +static inline bool sgx_epc_page_reclaim_in_progress(unsigned long flags)
> +{
> +	return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags &
> +						    SGX_EPC_PAGE_STATE_MASK);
> +}

	return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags & SGX_EPC_PAGE_STATE_MASK);

> +
>  static inline bool sgx_epc_page_reclaimable(unsigned long flags)
>  {
>  	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
> -- 
> 2.25.1


BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state
@ 2023-09-25 17:13     ` Jarkko Sakkinen
  0 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-25 17:13 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add RECLAIM_IN_PROGRESS state to not rely on list_empty(&epc_page->list)
> to determine if an EPC page is selected as a reclaiming candidate.
>
> When a page is being reclaimed from the page pool (sgx_global_lru),
> there is an intermediate stage where a page may have been identified as
> a candidate for reclaiming, but has not yet been reclaimed.  Currently
> such pages are list_del_init()'d from the global LRU list, and stored in
> a an array on stack. To prevent another thread from dropping the same
> page in the middle of reclaiming, sgx_drop_epc_page() checks for
> list_empty(&epc_page->list).
>
> A later patch will replace the array on stack with a temporary list to
> store the candidate pages, so list_empty() should no longer be used for
> this purpose.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
> V4:
> - Fixed some typos.
> - Revised commit message.
>
> V3:
> - Extend the sgx_epc_page_state enum introduced earlier to replace the
> flag based approach.
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 21 ++++++++++-----------
>  arch/x86/kernel/cpu/sgx/sgx.h  | 16 ++++++++++++++++
>  2 files changed, 26 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index b26860399402..c1ae19a154d0 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -312,13 +312,15 @@ static void sgx_reclaim_pages(void)
>  		list_del_init(&epc_page->list);
>  		encl_page = epc_page->owner;
>  
> -		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
> +		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
> +			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
>  			chunk[cnt++] = epc_page;
> -		else
> +		} else {
>  			/* The owner is freeing the page. No need to add the
>  			 * page back to the list of reclaimable pages.
>  			 */
>  			sgx_epc_page_reset_state(epc_page);
> +		}
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> @@ -528,16 +530,13 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  int sgx_drop_epc_page(struct sgx_epc_page *page)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	if (sgx_epc_page_reclaimable(page->flags)) {
> -		/* The page is being reclaimed. */
> -		if (list_empty(&page->list)) {
> -			spin_unlock(&sgx_global_lru.lock);
> -			return -EBUSY;
> -		}
> -
> -		list_del(&page->list);
> -		sgx_epc_page_reset_state(page);
> +	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
> +		spin_unlock(&sgx_global_lru.lock);
> +		return -EBUSY;
>  	}
> +
> +	list_del(&page->list);
> +	sgx_epc_page_reset_state(page);
>  	spin_unlock(&sgx_global_lru.lock);
>  
>  	return 0;
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 2faeb40b345f..764cec23f4e5 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -40,6 +40,8 @@ enum sgx_epc_page_state {
>  
>  	/* Page is in use and tracked in a reclaimable LRU list
>  	 * Becomes NOT_TRACKED after sgx_drop_epc()
> +	 * Becomes RECLAIM_IN_PROGRESS in sgx_reclaim_pages() when identified
> +	 * for reclaiming
>  	 */
>  	SGX_EPC_PAGE_RECLAIMABLE = 2,
>  
> @@ -50,6 +52,14 @@ enum sgx_epc_page_state {
>  	 */
>  	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
>  
> +	/* Page is being prepared for reclamation, tracked in a temporary
> +	 * isolated list by the reclaimer.
> +	 * Changes in sgx_reclaim_pages() back to RECLAIMABLE if preparation
> +	 * fails for any reason.
> +	 * Becomes NOT_TRACKED if reclaimed successfully in sgx_reclaim_pages()
> +	 * and immediately sgx_free_epc() is called to make it FREE.
> +	 */
> +	SGX_EPC_PAGE_RECLAIM_IN_PROGRESS = 4,
>  };
>  
>  #define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
> @@ -73,6 +83,12 @@ static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned lo
>  	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
>  }
>  
> +static inline bool sgx_epc_page_reclaim_in_progress(unsigned long flags)
> +{
> +	return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags &
> +						    SGX_EPC_PAGE_STATE_MASK);
> +}

	return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags & SGX_EPC_PAGE_STATE_MASK);

> +
>  static inline bool sgx_epc_page_reclaimable(unsigned long flags)
>  {
>  	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
> -- 
> 2.25.1


BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
@ 2023-09-25 17:15     ` Jarkko Sakkinen
  0 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-25 17:15 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
>
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
>
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
>
> This patch was modified from its original version to use the misc cgroup
> controller instead of a custom controller.
>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
> V5:
> - kernel-doc fixes (Jarkko)
>
> V4:
> - Fix a white space issue in Kconfig (Randy).
> - Update comments for LRU list as it can be owned by a cgroup.
> - Fix comments for sgx_reclaim_epc_pages() and use IS_ENABLED consistently (Mikko)
>
> V3:
>
> 1) Use the same maximum number of reclaiming candidate pages to be
> processed, SGX_NR_TO_SCAN_MAX, for each reclaiming iteration in both
> cgroup worker function and ksgxd. This fixes an overflow in the
> backing store buffer with the same fixed size allocated on stack in
> sgx_reclaim_epc_pages().
>
> 2) Initialize max for root EPC cgroup. Otherwise, all
> misc_cg_try_charge() calls would fail as it checks for all limits of
> ancestors all the way to the root node.
>
> 3) Start reclaiming whenever misc_cg_try_charge fails. Removed all
> re-checks for limits and current usage. For all purposes and intent,
> when misc_try_charge() fails, reclaiming is needed. This also corrects
> an error of not reclaiming when the child limit is larger than one of
> its ancestors.
>
> 4) Handle failure on charging to the root EPC cgroup. Failure on charging
> to root means we are at or above capacity, so start reclaiming or return
> OOM error.
>
> 5) Removed the custom cgroup tree walking iterator with epoch tracking
> logic. Replaced it with just the plain css_for_each_descendant_pre
> iterator. The custom iterator implemented a rather complex epoch scheme
> I believe was intended to prevent extra reclaiming from multiple worker
> threads doing the same walk but it turned out not matter much as each
> thread would only reclaim when usage is above limit. Using the plain
> css_for_each_descendant_pre iterator simplified code a bit.
>
> 6) Do not reclaim synchronously in misc_max_write callback which would
> block the user. Instead queue an async work item to run the reclaiming
> loop.
>
> 7) Other minor refactoring:
> - Remove unused params in epc_cgroup APIs
> - centralize uncharge into sgx_free_epc_page()
> ---
>  arch/x86/Kconfig                     |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile     |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 415 +++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 ++++
>  arch/x86/kernel/cpu/sgx/main.c       |  68 ++++-
>  arch/x86/kernel/cpu/sgx/sgx.h        |  17 +-
>  6 files changed, 556 insertions(+), 17 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..e17c5dc3aea4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>  
>  	  If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> +	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
> +	depends on X86_SGX && CGROUP_MISC
> +	help
> +	  Provides control over the EPC footprint of tasks in a cgroup via
> +	  the Miscellaneous cgroup controller.
> +
> +	  EPC is a subset of regular memory that is usable only by SGX
> +	  enclaves and is very limited in quantity, e.g. less than 1%
> +	  of total DRAM.
> +
> +	  Say N if unsure.
> +
>  config X86_USER_SHADOW_STACK
>  	bool "X86 userspace shadow stack"
>  	depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
>  	ioctl.o \
>  	main.o
>  obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC)	       += epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index 000000000000..b5da89cf3a4c
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,415 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
> +#include <linux/ratelimit.h>
> +#include <linux/sched/signal.h>
> +#include <linux/slab.h>
> +#include <linux/threads.h>
> +
> +#include "epc_cgroup.h"
> +
> +#define SGX_EPC_RECLAIM_MIN_PAGES		16UL
> +#define SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD	5
> +#define SGX_EPC_RECLAIM_OOM_THRESHOLD		5
> +
> +static struct workqueue_struct *sgx_epc_cg_wq;
> +static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root);
> +
> +struct sgx_epc_reclaim_control {
> +	struct sgx_epc_cgroup *epc_cg;
> +	int nr_fails;
> +	bool ignore_age;
> +};
> +
> +static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
> +}
> +
> +static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
> +}
> +
> +/*
> + * Get the lower bound of limits of a cgroup and its ancestors.
> + */
> +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
> +{
> +	struct misc_cg *i = epc_cg->cg;
> +	u64 m = U64_MAX;
> +
> +	while (i) {
> +		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
> +		i = misc_cg_parent(i);
> +	}
> +
> +	return m / PAGE_SIZE;
> +}
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> +	if (cg)
> +		return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +
> +	return NULL;
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> +	return !cgroup_subsys_enabled(misc_cgrp_subsys);
> +}
> +
> +/**
> + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its lrus
> + * @root:	root of the tree to check
> + *
> + * Return: %true if all cgroups under the specified root have empty LRU lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
> +{
> +	struct cgroup_subsys_state *css_root;
> +	struct cgroup_subsys_state *pos;
> +	struct sgx_epc_cgroup *epc_cg;
> +	bool ret = true;
> +
> +	/*
> +	 * Caller ensure css_root ref acquired
> +	 */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> +		spin_lock(&epc_cg->lru.lock);
> +		ret = list_empty(&epc_cg->lru.reclaimable);
> +		spin_unlock(&epc_cg->lru.lock);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!ret)
> +			break;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +/**
> + * sgx_epc_cgroup_isolate_pages() - walk a cgroup tree and separate pages
> + * @root:	root of the tree to start walking
> + * @nr_to_scan: The number of pages that need to be isolated
> + * @dst:	Destination list to hold the isolated pages
> + *
> + * Walk the cgroup tree and isolate the pages in the hierarchy
> + * for reclaiming.
> + */
> +void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +				  size_t *nr_to_scan, struct list_head *dst)
> +{
> +	struct cgroup_subsys_state *css_root;
> +	struct cgroup_subsys_state *pos;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	if (!*nr_to_scan)
> +		return;
> +
> +	 /* Caller ensure css_root ref acquired */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +		sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!*nr_to_scan)
> +			break;
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
> +static int sgx_epc_cgroup_reclaim_pages(unsigned long nr_pages,
> +					struct sgx_epc_reclaim_control *rc)
> +{
> +	/*
> +	 * Ensure sgx_reclaim_pages is called with a minimum and maximum
> +	 * number of pages.  Attempting to reclaim only a few pages will
> +	 * often fail and is inefficient, while reclaiming a huge number
> +	 * of pages can result in soft lockups due to holding various
> +	 * locks for an extended duration.
> +	 */
> +	nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
> +	nr_pages = min(nr_pages, SGX_NR_TO_SCAN_MAX);
> +
> +	return sgx_reclaim_epc_pages(nr_pages, rc->ignore_age, rc->epc_cg);
> +}
> +
> +static int sgx_epc_cgroup_reclaim_failed(struct sgx_epc_reclaim_control *rc)
> +{
> +	if (sgx_epc_cgroup_lru_empty(rc->epc_cg))
> +		return -ENOMEM;
> +
> +	++rc->nr_fails;
> +	if (rc->nr_fails > SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD)
> +		rc->ignore_age = true;
> +
> +	return 0;
> +}
> +
> +static inline
> +void sgx_epc_reclaim_control_init(struct sgx_epc_reclaim_control *rc,
> +				  struct sgx_epc_cgroup *epc_cg)
> +{
> +	rc->epc_cg = epc_cg;
> +	rc->nr_fails = 0;
> +	rc->ignore_age = false;
> +}
> +
> +/*
> + * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
> + * cgroup when the cgroup is at/near its maximum capacity
> + */
> +static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	struct sgx_epc_cgroup *epc_cg;
> +	u64 cur, max;
> +
> +	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +
> +	for (;;) {
> +		max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
> +
> +		/*
> +		 * Adjust the limit down by one page, the goal is to free up
> +		 * pages for fault allocations, not to simply obey the limit.
> +		 * Conditionally decrementing max also means the cur vs. max
> +		 * check will correctly handle the case where both are zero.
> +		 */
> +		if (max)
> +			max--;
> +
> +		/*
> +		 * Unless the limit is extremely low, in which case forcing
> +		 * reclaim will likely cause thrashing, force the cgroup to
> +		 * reclaim at least once if it's operating *near* its maximum
> +		 * limit by adjusting @max down by half the min reclaim size.
> +		 * This work func is scheduled by sgx_epc_cgroup_try_charge
> +		 * when it cannot directly reclaim due to being in an atomic
> +		 * context, e.g. EPC allocation in a fault handler.  Waiting
> +		 * to reclaim until the cgroup is actually at its limit is less
> +		 * performant as it means the faulting task is effectively
> +		 * blocked until a worker makes its way through the global work
> +		 * queue.
> +		 */
> +		if (max > SGX_NR_TO_SCAN_MAX)
> +			max -= (SGX_EPC_RECLAIM_MIN_PAGES / 2);
> +
> +		max = min(max, sgx_epc_total_pages);
> +		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
> +		if (cur <= max)
> +			break;
> +		/* Nothing reclaimable */
> +		if (sgx_epc_cgroup_lru_empty(epc_cg)) {
> +			if (!sgx_epc_cgroup_oom(epc_cg))
> +				break;
> +
> +			continue;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
> +			if (sgx_epc_cgroup_reclaim_failed(&rc))
> +				break;
> +		}
> +	}
> +}
> +
> +static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
> +				       bool reclaim)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	unsigned int nr_empty = 0;
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +
> +	for (;;) {
> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> +					PAGE_SIZE))
> +			break;
> +
> +		if (sgx_epc_cgroup_lru_empty(epc_cg))
> +			return -ENOMEM;
> +
> +		if (signal_pending(current))
> +			return -ERESTARTSYS;
> +
> +		if (!reclaim) {
> +			queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
> +			return -EBUSY;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(1, &rc)) {
> +			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
> +				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
> +					return -ENOMEM;
> +				schedule();
> +			}
> +		}
> +	}
> +	if (epc_cg->cg != misc_cg_root())
> +		css_get(&epc_cg->cg->css);
> +
> +	return 0;
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
> + * @mm:			the mm_struct of the process to charge
> + * @reclaim:		whether or not synchronous reclaim is allowed
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.
> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +	int ret;
> +
> +	if (sgx_epc_cgroup_disabled())
> +		return NULL;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> +	ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
> +	put_misc_cg(epc_cg->cg);
> +
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return epc_cg;
> +}
> +
> +/**
> + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
> + * @epc_cg:	the charged epc cgroup
> + */
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (sgx_epc_cgroup_disabled())
> +		return;
> +
> +	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> +	if (epc_cg->cg != misc_cg_root())
> +		put_misc_cg(epc_cg->cg);
> +}
> +
> +static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root)
> +{
> +	struct cgroup_subsys_state *css_root;
> +	struct cgroup_subsys_state *pos;
> +	struct sgx_epc_cgroup *epc_cg;
> +	bool oom = false;
> +
> +	 /* Caller ensure css_root ref acquired */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		/* skip dead ones */
> +		if (!css_tryget(pos))
> +			continue;
> +
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +		oom = sgx_epc_oom(&epc_cg->lru);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (oom)
> +			break;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return oom;
> +}
> +
> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +	cancel_work_sync(&epc_cg->reclaim_work);
> +	kfree(epc_cg);
> +}
> +
> +static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +	/* Let the reclaimer to do the work so user is not blocked */
> +	queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
> +}
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
> +	if (!epc_cg)
> +		return -ENOMEM;
> +
> +	sgx_lru_init(&epc_cg->lru);
> +	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
> +	cg->res[MISC_CG_RES_SGX_EPC].alloc = sgx_epc_cgroup_alloc;
> +	cg->res[MISC_CG_RES_SGX_EPC].free = sgx_epc_cgroup_free;
> +	cg->res[MISC_CG_RES_SGX_EPC].max_write = sgx_epc_cgroup_max_write;

It would be better to have ops structure and then in SGX code const
struct defining the ops.

> +	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> +	epc_cg->cg = cg;
> +
> +	return 0;
> +}
> +
> +static int __init sgx_epc_cgroup_init(void)
> +{
> +	struct misc_cg *cg;
> +
> +	if (!boot_cpu_has(X86_FEATURE_SGX))
> +		return 0;
> +
> +	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
> +					WQ_UNBOUND | WQ_FREEZABLE,
> +					WQ_UNBOUND_MAX_ACTIVE);
> +	BUG_ON(!sgx_epc_cg_wq);
> +
> +	cg = misc_cg_root();
> +	BUG_ON(!cg);
> +
> +	return sgx_epc_cgroup_alloc(cg);
> +}
> +subsys_initcall(sgx_epc_cgroup_init);
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> new file mode 100644
> index 000000000000..dfc902f4d96f
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,59 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2022 Intel Corporation. */
> +#ifndef _INTEL_SGX_EPC_CGROUP_H_
> +#define _INTEL_SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
> +#include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/misc_cgroup.h>
> +#include <linux/page_counter.h>
> +#include <linux/workqueue.h>
> +
> +#include "sgx.h"
> +
> +#ifndef CONFIG_CGROUP_SGX_EPC
> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
> +struct sgx_epc_cgroup;
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
> +{
> +	return NULL;
> +}
> +
> +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
> +
> +static inline void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +						size_t *nr_to_scan,
> +						struct list_head *dst) { }
> +
> +static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return NULL;
> +}
> +
> +static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
> +{
> +	return true;
> +}
> +#else
> +struct sgx_epc_cgroup {
> +	struct misc_cg *cg;
> +	struct sgx_epc_lru_lists	lru;
> +	struct work_struct	reclaim_work;
> +};
> +
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim);
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> +bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root);
> +void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +				  size_t *nr_to_scan, struct list_head *dst);
> +static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (epc_cg)
> +		return &epc_cg->lru;
> +	return NULL;
> +}
> +#endif
> +
> +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index d37ef0dd865f..0ade7792ff5f 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
>  #include <linux/highmem.h>
>  #include <linux/kthread.h>
>  #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
>  #include <linux/node.h>
>  #include <linux/pagemap.h>
>  #include <linux/ratelimit.h>
> @@ -17,12 +18,9 @@
>  #include "driver.h"
>  #include "encl.h"
>  #include "encls.h"
> +#include "epc_cgroup.h"
>  
> -/*
> - * Maximum number of pages to scan for reclaiming.
> - */
> -#define SGX_NR_TO_SCAN_MAX	32
> -
> +u64 sgx_epc_total_pages;
>  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  static int sgx_nr_epc_sections;
>  static struct task_struct *ksgxd_tsk;
> @@ -37,11 +35,17 @@ static struct sgx_epc_lru_lists sgx_global_lru;
>  
>  static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
>  {
> +	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> +		return epc_cg_lru(epc_page->epc_cg);
> +
>  	return &sgx_global_lru;
>  }
>  
>  static inline bool sgx_can_reclaim(void)
>  {
> +	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> +		return !sgx_epc_cgroup_lru_empty(NULL);
> +
>  	return !list_empty(&sgx_global_lru.reclaimable);
>  }
>  
> @@ -300,14 +304,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>   * @nr_to_scan:	Number of pages to scan for reclaim
>   * @dst:	Destination list to hold the isolated pages
>   */
> -void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
> +void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t *nr_to_scan,
>  			   struct list_head *dst)
>  {
>  	struct sgx_encl_page *encl_page;
>  	struct sgx_epc_page *epc_page;
>  
>  	spin_lock(&lru->lock);
> -	for (; nr_to_scan > 0; --nr_to_scan) {
> +	for (; *nr_to_scan > 0; --(*nr_to_scan)) {
>  		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
>  		if (!epc_page)
>  			break;
> @@ -332,6 +336,7 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
>   * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
>   * @nr_to_scan:		 Number of EPC pages to scan for reclaim
>   * @ignore_age:		 Reclaim a page even if it is young
> + * @epc_cg:		 EPC cgroup from which to reclaim
>   *
>   * Take a fixed number of pages from the head of the active page pool and
>   * reclaim them to the enclave's private shmem files. Skip the pages, which have
> @@ -345,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
> +			     struct sgx_epc_cgroup *epc_cg)
>  {
>  	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>  	struct sgx_epc_page *epc_page, *tmp;
> @@ -355,7 +361,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>  	LIST_HEAD(iso);
>  	size_t ret, i;
>  
> -	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
> +	/*
> +	 * If a specific cgroup is not being targeted, take from the global
> +	 * list first, even when cgroups are enabled.  If there are
> +	 * pages on the global LRU then they should get reclaimed asap.
> +	 */
> +	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
> +		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
> +
> +	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
>  
>  	if (list_empty(&iso))
>  		return 0;
> @@ -423,7 +437,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  void sgx_reclaim_direct(void)
>  {
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  }
>  
>  static int ksgxd(void *p)
> @@ -446,7 +460,7 @@ static int ksgxd(void *p)
>  				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>  
>  		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> -			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  
>  		cond_resched();
>  	}
> @@ -600,6 +614,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  {
>  	struct sgx_epc_page *page;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_try_charge(reclaim);
> +	if (IS_ERR(epc_cg))
> +		return ERR_CAST(epc_cg);
>  
>  	for ( ; ; ) {
>  		page = __sgx_alloc_epc_page();
> @@ -608,8 +627,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		if (!sgx_can_reclaim())
> -			return ERR_PTR(-ENOMEM);
> +		if (!sgx_can_reclaim()) {
> +			page = ERR_PTR(-ENOMEM);
> +			break;
> +		}
>  
>  		if (!reclaim) {
>  			page = ERR_PTR(-EBUSY);
> @@ -621,10 +642,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  		cond_resched();
>  	}
>  
> +	if (!IS_ERR(page)) {
> +		WARN_ON_ONCE(page->epc_cg);
> +		page->epc_cg = epc_cg;
> +	} else {
> +		sgx_epc_cgroup_uncharge(epc_cg);
> +	}
> +
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>  		wake_up(&ksgxd_waitq);
>  
> @@ -647,6 +675,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  
>  	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
>  
> +	if (page->epc_cg) {
> +		sgx_epc_cgroup_uncharge(page->epc_cg);
> +		page->epc_cg = NULL;
> +	}
> +
>  	spin_lock(&node->lock);
>  
>  	page->encl_page = NULL;
> @@ -657,6 +690,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	page->flags = SGX_EPC_PAGE_FREE;
>  
>  	spin_unlock(&node->lock);
> +
>  	atomic_long_inc(&sgx_nr_free_pages);
>  }
>  
> @@ -826,6 +860,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  		section->pages[i].flags = 0;
>  		section->pages[i].encl_page = NULL;
>  		section->pages[i].poison = 0;
> +		section->pages[i].epc_cg = NULL;
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
>  
> @@ -970,6 +1005,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
>  static bool __init sgx_page_cache_init(void)
>  {
>  	u32 eax, ebx, ecx, edx, type;
> +	u64 capacity = 0;
>  	u64 pa, size;
>  	int nid;
>  	int i;
> @@ -1020,6 +1056,7 @@ static bool __init sgx_page_cache_init(void)
>  
>  		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
>  		sgx_numa_nodes[nid].size += size;
> +		capacity += size;
>  
>  		sgx_nr_epc_sections++;
>  	}
> @@ -1029,6 +1066,9 @@ static bool __init sgx_page_cache_init(void)
>  		return false;
>  	}
>  
> +	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +	sgx_epc_total_pages = capacity >> PAGE_SHIFT;
> +
>  	return true;
>  }
>  
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 42075762084c..1b90a905a9e2 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -19,6 +19,11 @@
>  
>  #define SGX_MAX_EPC_SECTIONS		8
>  #define SGX_EEXTEND_BLOCK_SIZE		256
> +
> +/*
> + * Maximum number of pages to scan for reclaiming.
> + */
> +#define SGX_NR_TO_SCAN_MAX		32UL
>  #define SGX_NR_TO_SCAN			16
>  #define SGX_NR_LOW_PAGES		32
>  #define SGX_NR_HIGH_PAGES		64
> @@ -70,6 +75,8 @@ enum sgx_epc_page_state {
>  /* flag for pages owned by a sgx_encl struct */
>  #define SGX_EPC_OWNER_ENCL		BIT(4)
>  
> +struct sgx_epc_cgroup;
> +
>  struct sgx_epc_page {
>  	unsigned int section;
>  	u16 flags;
> @@ -81,6 +88,7 @@ struct sgx_epc_page {
>  		struct sgx_encl *encl;
>  	};
>  	struct list_head list;
> +	struct sgx_epc_cgroup *epc_cg;
>  };
>  
>  static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
> @@ -129,6 +137,7 @@ struct sgx_epc_section {
>  	struct sgx_numa_node *node;
>  };
>  
> +extern u64 sgx_epc_total_pages;
>  extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  
>  static inline unsigned long sgx_get_epc_phys_addr(struct sgx_epc_page *page)
> @@ -152,7 +161,8 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
>  }
>  
>  /*
> - * Contains EPC pages tracked by the reclaimer (ksgxd).
> + * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
> + * cgroup.
>   */
>  struct sgx_epc_lru_lists {
>  	spinlock_t lock;
> @@ -179,8 +189,9 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
>  int sgx_drop_epc_page(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
>  bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
> -void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
> +			     struct sgx_epc_cgroup *epc_cg);
> +void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t *nr_to_scan,
>  			   struct list_head *dst);
>  
>  void sgx_ipi_cb(void *info);
> -- 
> 2.25.1


BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
@ 2023-09-25 17:15     ` Jarkko Sakkinen
  0 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-25 17:15 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
>
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
>
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
>
> This patch was modified from its original version to use the misc cgroup
> controller instead of a custom controller.
>
> Co-developed-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Co-developed-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Tested-by: Mikko Ylinen <mikko.ylinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> V5:
> - kernel-doc fixes (Jarkko)
>
> V4:
> - Fix a white space issue in Kconfig (Randy).
> - Update comments for LRU list as it can be owned by a cgroup.
> - Fix comments for sgx_reclaim_epc_pages() and use IS_ENABLED consistently (Mikko)
>
> V3:
>
> 1) Use the same maximum number of reclaiming candidate pages to be
> processed, SGX_NR_TO_SCAN_MAX, for each reclaiming iteration in both
> cgroup worker function and ksgxd. This fixes an overflow in the
> backing store buffer with the same fixed size allocated on stack in
> sgx_reclaim_epc_pages().
>
> 2) Initialize max for root EPC cgroup. Otherwise, all
> misc_cg_try_charge() calls would fail as it checks for all limits of
> ancestors all the way to the root node.
>
> 3) Start reclaiming whenever misc_cg_try_charge fails. Removed all
> re-checks for limits and current usage. For all purposes and intent,
> when misc_try_charge() fails, reclaiming is needed. This also corrects
> an error of not reclaiming when the child limit is larger than one of
> its ancestors.
>
> 4) Handle failure on charging to the root EPC cgroup. Failure on charging
> to root means we are at or above capacity, so start reclaiming or return
> OOM error.
>
> 5) Removed the custom cgroup tree walking iterator with epoch tracking
> logic. Replaced it with just the plain css_for_each_descendant_pre
> iterator. The custom iterator implemented a rather complex epoch scheme
> I believe was intended to prevent extra reclaiming from multiple worker
> threads doing the same walk but it turned out not matter much as each
> thread would only reclaim when usage is above limit. Using the plain
> css_for_each_descendant_pre iterator simplified code a bit.
>
> 6) Do not reclaim synchronously in misc_max_write callback which would
> block the user. Instead queue an async work item to run the reclaiming
> loop.
>
> 7) Other minor refactoring:
> - Remove unused params in epc_cgroup APIs
> - centralize uncharge into sgx_free_epc_page()
> ---
>  arch/x86/Kconfig                     |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile     |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 415 +++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 ++++
>  arch/x86/kernel/cpu/sgx/main.c       |  68 ++++-
>  arch/x86/kernel/cpu/sgx/sgx.h        |  17 +-
>  6 files changed, 556 insertions(+), 17 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..e17c5dc3aea4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>  
>  	  If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> +	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
> +	depends on X86_SGX && CGROUP_MISC
> +	help
> +	  Provides control over the EPC footprint of tasks in a cgroup via
> +	  the Miscellaneous cgroup controller.
> +
> +	  EPC is a subset of regular memory that is usable only by SGX
> +	  enclaves and is very limited in quantity, e.g. less than 1%
> +	  of total DRAM.
> +
> +	  Say N if unsure.
> +
>  config X86_USER_SHADOW_STACK
>  	bool "X86 userspace shadow stack"
>  	depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
>  	ioctl.o \
>  	main.o
>  obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC)	       += epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index 000000000000..b5da89cf3a4c
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,415 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
> +#include <linux/ratelimit.h>
> +#include <linux/sched/signal.h>
> +#include <linux/slab.h>
> +#include <linux/threads.h>
> +
> +#include "epc_cgroup.h"
> +
> +#define SGX_EPC_RECLAIM_MIN_PAGES		16UL
> +#define SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD	5
> +#define SGX_EPC_RECLAIM_OOM_THRESHOLD		5
> +
> +static struct workqueue_struct *sgx_epc_cg_wq;
> +static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root);
> +
> +struct sgx_epc_reclaim_control {
> +	struct sgx_epc_cgroup *epc_cg;
> +	int nr_fails;
> +	bool ignore_age;
> +};
> +
> +static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
> +}
> +
> +static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
> +}
> +
> +/*
> + * Get the lower bound of limits of a cgroup and its ancestors.
> + */
> +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
> +{
> +	struct misc_cg *i = epc_cg->cg;
> +	u64 m = U64_MAX;
> +
> +	while (i) {
> +		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
> +		i = misc_cg_parent(i);
> +	}
> +
> +	return m / PAGE_SIZE;
> +}
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> +	if (cg)
> +		return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +
> +	return NULL;
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> +	return !cgroup_subsys_enabled(misc_cgrp_subsys);
> +}
> +
> +/**
> + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its lrus
> + * @root:	root of the tree to check
> + *
> + * Return: %true if all cgroups under the specified root have empty LRU lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
> +{
> +	struct cgroup_subsys_state *css_root;
> +	struct cgroup_subsys_state *pos;
> +	struct sgx_epc_cgroup *epc_cg;
> +	bool ret = true;
> +
> +	/*
> +	 * Caller ensure css_root ref acquired
> +	 */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> +		spin_lock(&epc_cg->lru.lock);
> +		ret = list_empty(&epc_cg->lru.reclaimable);
> +		spin_unlock(&epc_cg->lru.lock);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!ret)
> +			break;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +/**
> + * sgx_epc_cgroup_isolate_pages() - walk a cgroup tree and separate pages
> + * @root:	root of the tree to start walking
> + * @nr_to_scan: The number of pages that need to be isolated
> + * @dst:	Destination list to hold the isolated pages
> + *
> + * Walk the cgroup tree and isolate the pages in the hierarchy
> + * for reclaiming.
> + */
> +void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +				  size_t *nr_to_scan, struct list_head *dst)
> +{
> +	struct cgroup_subsys_state *css_root;
> +	struct cgroup_subsys_state *pos;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	if (!*nr_to_scan)
> +		return;
> +
> +	 /* Caller ensure css_root ref acquired */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +		sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!*nr_to_scan)
> +			break;
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
> +static int sgx_epc_cgroup_reclaim_pages(unsigned long nr_pages,
> +					struct sgx_epc_reclaim_control *rc)
> +{
> +	/*
> +	 * Ensure sgx_reclaim_pages is called with a minimum and maximum
> +	 * number of pages.  Attempting to reclaim only a few pages will
> +	 * often fail and is inefficient, while reclaiming a huge number
> +	 * of pages can result in soft lockups due to holding various
> +	 * locks for an extended duration.
> +	 */
> +	nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
> +	nr_pages = min(nr_pages, SGX_NR_TO_SCAN_MAX);
> +
> +	return sgx_reclaim_epc_pages(nr_pages, rc->ignore_age, rc->epc_cg);
> +}
> +
> +static int sgx_epc_cgroup_reclaim_failed(struct sgx_epc_reclaim_control *rc)
> +{
> +	if (sgx_epc_cgroup_lru_empty(rc->epc_cg))
> +		return -ENOMEM;
> +
> +	++rc->nr_fails;
> +	if (rc->nr_fails > SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD)
> +		rc->ignore_age = true;
> +
> +	return 0;
> +}
> +
> +static inline
> +void sgx_epc_reclaim_control_init(struct sgx_epc_reclaim_control *rc,
> +				  struct sgx_epc_cgroup *epc_cg)
> +{
> +	rc->epc_cg = epc_cg;
> +	rc->nr_fails = 0;
> +	rc->ignore_age = false;
> +}
> +
> +/*
> + * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
> + * cgroup when the cgroup is at/near its maximum capacity
> + */
> +static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	struct sgx_epc_cgroup *epc_cg;
> +	u64 cur, max;
> +
> +	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +
> +	for (;;) {
> +		max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
> +
> +		/*
> +		 * Adjust the limit down by one page, the goal is to free up
> +		 * pages for fault allocations, not to simply obey the limit.
> +		 * Conditionally decrementing max also means the cur vs. max
> +		 * check will correctly handle the case where both are zero.
> +		 */
> +		if (max)
> +			max--;
> +
> +		/*
> +		 * Unless the limit is extremely low, in which case forcing
> +		 * reclaim will likely cause thrashing, force the cgroup to
> +		 * reclaim at least once if it's operating *near* its maximum
> +		 * limit by adjusting @max down by half the min reclaim size.
> +		 * This work func is scheduled by sgx_epc_cgroup_try_charge
> +		 * when it cannot directly reclaim due to being in an atomic
> +		 * context, e.g. EPC allocation in a fault handler.  Waiting
> +		 * to reclaim until the cgroup is actually at its limit is less
> +		 * performant as it means the faulting task is effectively
> +		 * blocked until a worker makes its way through the global work
> +		 * queue.
> +		 */
> +		if (max > SGX_NR_TO_SCAN_MAX)
> +			max -= (SGX_EPC_RECLAIM_MIN_PAGES / 2);
> +
> +		max = min(max, sgx_epc_total_pages);
> +		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
> +		if (cur <= max)
> +			break;
> +		/* Nothing reclaimable */
> +		if (sgx_epc_cgroup_lru_empty(epc_cg)) {
> +			if (!sgx_epc_cgroup_oom(epc_cg))
> +				break;
> +
> +			continue;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
> +			if (sgx_epc_cgroup_reclaim_failed(&rc))
> +				break;
> +		}
> +	}
> +}
> +
> +static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
> +				       bool reclaim)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	unsigned int nr_empty = 0;
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +
> +	for (;;) {
> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> +					PAGE_SIZE))
> +			break;
> +
> +		if (sgx_epc_cgroup_lru_empty(epc_cg))
> +			return -ENOMEM;
> +
> +		if (signal_pending(current))
> +			return -ERESTARTSYS;
> +
> +		if (!reclaim) {
> +			queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
> +			return -EBUSY;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(1, &rc)) {
> +			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
> +				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
> +					return -ENOMEM;
> +				schedule();
> +			}
> +		}
> +	}
> +	if (epc_cg->cg != misc_cg_root())
> +		css_get(&epc_cg->cg->css);
> +
> +	return 0;
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
> + * @mm:			the mm_struct of the process to charge
> + * @reclaim:		whether or not synchronous reclaim is allowed
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.
> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +	int ret;
> +
> +	if (sgx_epc_cgroup_disabled())
> +		return NULL;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> +	ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
> +	put_misc_cg(epc_cg->cg);
> +
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return epc_cg;
> +}
> +
> +/**
> + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
> + * @epc_cg:	the charged epc cgroup
> + */
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (sgx_epc_cgroup_disabled())
> +		return;
> +
> +	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> +	if (epc_cg->cg != misc_cg_root())
> +		put_misc_cg(epc_cg->cg);
> +}
> +
> +static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root)
> +{
> +	struct cgroup_subsys_state *css_root;
> +	struct cgroup_subsys_state *pos;
> +	struct sgx_epc_cgroup *epc_cg;
> +	bool oom = false;
> +
> +	 /* Caller ensure css_root ref acquired */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		/* skip dead ones */
> +		if (!css_tryget(pos))
> +			continue;
> +
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +		oom = sgx_epc_oom(&epc_cg->lru);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (oom)
> +			break;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return oom;
> +}
> +
> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +	cancel_work_sync(&epc_cg->reclaim_work);
> +	kfree(epc_cg);
> +}
> +
> +static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +	/* Let the reclaimer to do the work so user is not blocked */
> +	queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
> +}
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
> +	if (!epc_cg)
> +		return -ENOMEM;
> +
> +	sgx_lru_init(&epc_cg->lru);
> +	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
> +	cg->res[MISC_CG_RES_SGX_EPC].alloc = sgx_epc_cgroup_alloc;
> +	cg->res[MISC_CG_RES_SGX_EPC].free = sgx_epc_cgroup_free;
> +	cg->res[MISC_CG_RES_SGX_EPC].max_write = sgx_epc_cgroup_max_write;

It would be better to have ops structure and then in SGX code const
struct defining the ops.

> +	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> +	epc_cg->cg = cg;
> +
> +	return 0;
> +}
> +
> +static int __init sgx_epc_cgroup_init(void)
> +{
> +	struct misc_cg *cg;
> +
> +	if (!boot_cpu_has(X86_FEATURE_SGX))
> +		return 0;
> +
> +	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
> +					WQ_UNBOUND | WQ_FREEZABLE,
> +					WQ_UNBOUND_MAX_ACTIVE);
> +	BUG_ON(!sgx_epc_cg_wq);
> +
> +	cg = misc_cg_root();
> +	BUG_ON(!cg);
> +
> +	return sgx_epc_cgroup_alloc(cg);
> +}
> +subsys_initcall(sgx_epc_cgroup_init);
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> new file mode 100644
> index 000000000000..dfc902f4d96f
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,59 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2022 Intel Corporation. */
> +#ifndef _INTEL_SGX_EPC_CGROUP_H_
> +#define _INTEL_SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
> +#include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/misc_cgroup.h>
> +#include <linux/page_counter.h>
> +#include <linux/workqueue.h>
> +
> +#include "sgx.h"
> +
> +#ifndef CONFIG_CGROUP_SGX_EPC
> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
> +struct sgx_epc_cgroup;
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
> +{
> +	return NULL;
> +}
> +
> +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
> +
> +static inline void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +						size_t *nr_to_scan,
> +						struct list_head *dst) { }
> +
> +static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return NULL;
> +}
> +
> +static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
> +{
> +	return true;
> +}
> +#else
> +struct sgx_epc_cgroup {
> +	struct misc_cg *cg;
> +	struct sgx_epc_lru_lists	lru;
> +	struct work_struct	reclaim_work;
> +};
> +
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim);
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> +bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root);
> +void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +				  size_t *nr_to_scan, struct list_head *dst);
> +static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (epc_cg)
> +		return &epc_cg->lru;
> +	return NULL;
> +}
> +#endif
> +
> +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index d37ef0dd865f..0ade7792ff5f 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
>  #include <linux/highmem.h>
>  #include <linux/kthread.h>
>  #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
>  #include <linux/node.h>
>  #include <linux/pagemap.h>
>  #include <linux/ratelimit.h>
> @@ -17,12 +18,9 @@
>  #include "driver.h"
>  #include "encl.h"
>  #include "encls.h"
> +#include "epc_cgroup.h"
>  
> -/*
> - * Maximum number of pages to scan for reclaiming.
> - */
> -#define SGX_NR_TO_SCAN_MAX	32
> -
> +u64 sgx_epc_total_pages;
>  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  static int sgx_nr_epc_sections;
>  static struct task_struct *ksgxd_tsk;
> @@ -37,11 +35,17 @@ static struct sgx_epc_lru_lists sgx_global_lru;
>  
>  static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
>  {
> +	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> +		return epc_cg_lru(epc_page->epc_cg);
> +
>  	return &sgx_global_lru;
>  }
>  
>  static inline bool sgx_can_reclaim(void)
>  {
> +	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> +		return !sgx_epc_cgroup_lru_empty(NULL);
> +
>  	return !list_empty(&sgx_global_lru.reclaimable);
>  }
>  
> @@ -300,14 +304,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>   * @nr_to_scan:	Number of pages to scan for reclaim
>   * @dst:	Destination list to hold the isolated pages
>   */
> -void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
> +void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t *nr_to_scan,
>  			   struct list_head *dst)
>  {
>  	struct sgx_encl_page *encl_page;
>  	struct sgx_epc_page *epc_page;
>  
>  	spin_lock(&lru->lock);
> -	for (; nr_to_scan > 0; --nr_to_scan) {
> +	for (; *nr_to_scan > 0; --(*nr_to_scan)) {
>  		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
>  		if (!epc_page)
>  			break;
> @@ -332,6 +336,7 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
>   * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
>   * @nr_to_scan:		 Number of EPC pages to scan for reclaim
>   * @ignore_age:		 Reclaim a page even if it is young
> + * @epc_cg:		 EPC cgroup from which to reclaim
>   *
>   * Take a fixed number of pages from the head of the active page pool and
>   * reclaim them to the enclave's private shmem files. Skip the pages, which have
> @@ -345,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
> +			     struct sgx_epc_cgroup *epc_cg)
>  {
>  	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>  	struct sgx_epc_page *epc_page, *tmp;
> @@ -355,7 +361,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>  	LIST_HEAD(iso);
>  	size_t ret, i;
>  
> -	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
> +	/*
> +	 * If a specific cgroup is not being targeted, take from the global
> +	 * list first, even when cgroups are enabled.  If there are
> +	 * pages on the global LRU then they should get reclaimed asap.
> +	 */
> +	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
> +		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
> +
> +	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
>  
>  	if (list_empty(&iso))
>  		return 0;
> @@ -423,7 +437,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  void sgx_reclaim_direct(void)
>  {
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  }
>  
>  static int ksgxd(void *p)
> @@ -446,7 +460,7 @@ static int ksgxd(void *p)
>  				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>  
>  		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> -			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  
>  		cond_resched();
>  	}
> @@ -600,6 +614,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  {
>  	struct sgx_epc_page *page;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_try_charge(reclaim);
> +	if (IS_ERR(epc_cg))
> +		return ERR_CAST(epc_cg);
>  
>  	for ( ; ; ) {
>  		page = __sgx_alloc_epc_page();
> @@ -608,8 +627,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		if (!sgx_can_reclaim())
> -			return ERR_PTR(-ENOMEM);
> +		if (!sgx_can_reclaim()) {
> +			page = ERR_PTR(-ENOMEM);
> +			break;
> +		}
>  
>  		if (!reclaim) {
>  			page = ERR_PTR(-EBUSY);
> @@ -621,10 +642,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  		cond_resched();
>  	}
>  
> +	if (!IS_ERR(page)) {
> +		WARN_ON_ONCE(page->epc_cg);
> +		page->epc_cg = epc_cg;
> +	} else {
> +		sgx_epc_cgroup_uncharge(epc_cg);
> +	}
> +
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>  		wake_up(&ksgxd_waitq);
>  
> @@ -647,6 +675,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  
>  	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
>  
> +	if (page->epc_cg) {
> +		sgx_epc_cgroup_uncharge(page->epc_cg);
> +		page->epc_cg = NULL;
> +	}
> +
>  	spin_lock(&node->lock);
>  
>  	page->encl_page = NULL;
> @@ -657,6 +690,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	page->flags = SGX_EPC_PAGE_FREE;
>  
>  	spin_unlock(&node->lock);
> +
>  	atomic_long_inc(&sgx_nr_free_pages);
>  }
>  
> @@ -826,6 +860,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  		section->pages[i].flags = 0;
>  		section->pages[i].encl_page = NULL;
>  		section->pages[i].poison = 0;
> +		section->pages[i].epc_cg = NULL;
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
>  
> @@ -970,6 +1005,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
>  static bool __init sgx_page_cache_init(void)
>  {
>  	u32 eax, ebx, ecx, edx, type;
> +	u64 capacity = 0;
>  	u64 pa, size;
>  	int nid;
>  	int i;
> @@ -1020,6 +1056,7 @@ static bool __init sgx_page_cache_init(void)
>  
>  		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
>  		sgx_numa_nodes[nid].size += size;
> +		capacity += size;
>  
>  		sgx_nr_epc_sections++;
>  	}
> @@ -1029,6 +1066,9 @@ static bool __init sgx_page_cache_init(void)
>  		return false;
>  	}
>  
> +	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +	sgx_epc_total_pages = capacity >> PAGE_SHIFT;
> +
>  	return true;
>  }
>  
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 42075762084c..1b90a905a9e2 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -19,6 +19,11 @@
>  
>  #define SGX_MAX_EPC_SECTIONS		8
>  #define SGX_EEXTEND_BLOCK_SIZE		256
> +
> +/*
> + * Maximum number of pages to scan for reclaiming.
> + */
> +#define SGX_NR_TO_SCAN_MAX		32UL
>  #define SGX_NR_TO_SCAN			16
>  #define SGX_NR_LOW_PAGES		32
>  #define SGX_NR_HIGH_PAGES		64
> @@ -70,6 +75,8 @@ enum sgx_epc_page_state {
>  /* flag for pages owned by a sgx_encl struct */
>  #define SGX_EPC_OWNER_ENCL		BIT(4)
>  
> +struct sgx_epc_cgroup;
> +
>  struct sgx_epc_page {
>  	unsigned int section;
>  	u16 flags;
> @@ -81,6 +88,7 @@ struct sgx_epc_page {
>  		struct sgx_encl *encl;
>  	};
>  	struct list_head list;
> +	struct sgx_epc_cgroup *epc_cg;
>  };
>  
>  static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
> @@ -129,6 +137,7 @@ struct sgx_epc_section {
>  	struct sgx_numa_node *node;
>  };
>  
> +extern u64 sgx_epc_total_pages;
>  extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  
>  static inline unsigned long sgx_get_epc_phys_addr(struct sgx_epc_page *page)
> @@ -152,7 +161,8 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
>  }
>  
>  /*
> - * Contains EPC pages tracked by the reclaimer (ksgxd).
> + * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
> + * cgroup.
>   */
>  struct sgx_epc_lru_lists {
>  	spinlock_t lock;
> @@ -179,8 +189,9 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
>  int sgx_drop_epc_page(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
>  bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
> -void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
> +			     struct sgx_epc_cgroup *epc_cg);
> +void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t *nr_to_scan,
>  			   struct list_head *dst);
>  
>  void sgx_ipi_cb(void *info);
> -- 
> 2.25.1


BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
@ 2023-09-25 18:50     ` Tejun Heo
  0 siblings, 0 replies; 144+ messages in thread
From: Tejun Heo @ 2023-09-25 18:50 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko, dave.hansen, linux-kernel, linux-sgx, x86, cgroups, tglx,
	mingo, bp, hpa, sohil.mehta, zhiquan1.li, kristen, seanjc, zhanb,
	anakrish, mikko.ylinen, yangjie

On Fri, Sep 22, 2023 at 08:06:41PM -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
> for the misc controller.
> 
> Add per resource type private data so that SGX can store additional per
> cgroup data in misc_cg->misc_cg_res[MISC_CG_RES_SGX_EPC].
> 
> Export misc_cg_root() so the SGX driver can initialize and add those
> additional structures to the root misc cgroup as part of initialization
> for EPC cgroup support. This bootstraps the same additional
> initialization for non-root cgroups in the 'alloc()' callback added in the
> previous patch.
> 
> The SGX driver, as the EPC memory provider, will have a background
> worker to reclaim EPC pages to make room for new allocations in the same
> cgroup when its usage counter reaches near the limit controlled by the
> cgroup and its ancestors. Therefore it needs to do a walk from the
> current cgroup up to the root. To enable this walk, move parent_misc()
> into misc_cgroup.h and make inline to make this function available to
> SGX, rename it to misc_cg_parent(), and update kernel/cgroup/misc.c to
> use the new name.
> 
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
@ 2023-09-25 18:50     ` Tejun Heo
  0 siblings, 0 replies; 144+ messages in thread
From: Tejun Heo @ 2023-09-25 18:50 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w,
	zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Fri, Sep 22, 2023 at 08:06:41PM -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> 
> Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
> for the misc controller.
> 
> Add per resource type private data so that SGX can store additional per
> cgroup data in misc_cg->misc_cg_res[MISC_CG_RES_SGX_EPC].
> 
> Export misc_cg_root() so the SGX driver can initialize and add those
> additional structures to the root misc cgroup as part of initialization
> for EPC cgroup support. This bootstraps the same additional
> initialization for non-root cgroups in the 'alloc()' callback added in the
> previous patch.
> 
> The SGX driver, as the EPC memory provider, will have a background
> worker to reclaim EPC pages to make room for new allocations in the same
> cgroup when its usage counter reaches near the limit controlled by the
> cgroup and its ancestors. Therefore it needs to do a walk from the
> current cgroup up to the root. To enable this walk, move parent_misc()
> into misc_cgroup.h and make inline to make this function available to
> SGX, rename it to misc_cg_parent(), and update kernel/cgroup/misc.c to
> use the new name.
> 
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-25 17:09     ` Jarkko Sakkinen
  (?)
@ 2023-09-26  3:04     ` Haitao Huang
  2023-09-26 13:10         ` Jarkko Sakkinen
  -1 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-09-26  3:04 UTC (permalink / raw)
  To: dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups, tglx,
	mingo, bp, hpa, sohil.mehta, Jarkko Sakkinen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

Hi Jarkko

On Mon, 25 Sep 2023 12:09:21 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:

> On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>>
>> The misc cgroup controller (subsystem) currently does not perform
>> resource type specific action for Cgroups Subsystem State (CSS) events:
>> the 'css_alloc' event when a cgroup is created and the 'css_free' event
>> when a cgroup is destroyed, or in event of user writing the max value to
>> the misc.max file to set the usage limit of a specific resource
>> [admin-guide/cgroup-v2.rst, 5-9. Misc].
>>
>> Define callbacks for those events and allow resource providers to
>> register the callbacks per resource type as needed. This will be
>> utilized later by the EPC misc cgroup support implemented in the SGX
>> driver:
>> - On css_alloc, allocate and initialize necessary structures for EPC
>> reclaiming, e.g., LRU list, work queue, etc.
>> - On css_free, cleanup and free those structures created in alloc.
>> - On max_write, trigger EPC reclaiming if the new limit is at or below
>> current usage.
>>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> ---
>> V5:
>> - Remove prefixes from the callback names (tj)
>> - Update commit message (Jarkko)
>>
>> V4:
>> - Moved this to the front of the series.
>> - Applies on cgroup/for-6.6 with the overflow fix for misc.
>>
>> V3:
>> - Removed the released() callback
>> ---
>>  include/linux/misc_cgroup.h |  5 +++++
>>  kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
>>  2 files changed, 34 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
>> index e799b1f8d05b..96a88822815a 100644
>> --- a/include/linux/misc_cgroup.h
>> +++ b/include/linux/misc_cgroup.h
>> @@ -37,6 +37,11 @@ struct misc_res {
>>  	u64 max;
>>  	atomic64_t usage;
>>  	atomic64_t events;
>> +
>> +	/* per resource callback ops */
>> +	int (*alloc)(struct misc_cg *cg);
>> +	void (*free)(struct misc_cg *cg);
>> +	void (*max_write)(struct misc_cg *cg);
>>  };
>>
>>  /**
>> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
>> index 79a3717a5803..62c9198dee21 100644
>> --- a/kernel/cgroup/misc.c
>> +++ b/kernel/cgroup/misc.c
>> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct  
>> kernfs_open_file *of, char *buf,
>>
>>  	cg = css_misc(of_css(of));
>>
>> -	if (READ_ONCE(misc_res_capacity[type]))
>> +	if (READ_ONCE(misc_res_capacity[type])) {
>>  		WRITE_ONCE(cg->res[type].max, max);
>> -	else
>> +		if (cg->res[type].max_write)
>> +			cg->res[type].max_write(cg);
>> +	} else {
>>  		ret = -EINVAL;
>> +	}
>>
>>  	return ret ? ret : nbytes;
>>  }
>> @@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
>>  static struct cgroup_subsys_state *
>>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>>  {
>> +	struct misc_cg *parent_cg;
>>  	enum misc_res_type i;
>>  	struct misc_cg *cg;
>> +	int ret;
>>
>>  	if (!parent_css) {
>>  		cg = &root_cg;
>> +		parent_cg = &root_cg;
>>  	} else {
>>  		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
>>  		if (!cg)
>>  			return ERR_PTR(-ENOMEM);
>> +		parent_cg = css_misc(parent_css);
>>  	}
>>
>>  	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
>>  		WRITE_ONCE(cg->res[i].max, MAX_NUM);
>>  		atomic64_set(&cg->res[i].usage, 0);
>> +		if (parent_cg->res[i].alloc) {
>> +			ret = parent_cg->res[i].alloc(cg);
>> +			if (ret)
>> +				goto alloc_err;
>> +		}
>>  	}
>>
>>  	return &cg->css;
>> +
>> +alloc_err:
>> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
>> +		if (parent_cg->res[i].free)
>> +			cg->res[i].free(cg);
>> +	kfree(cg);
>> +	return ERR_PTR(ret);
>>  }
>>
>>  /**
>> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state  
>> *parent_css)
>>   */
>>  static void misc_cg_free(struct cgroup_subsys_state *css)
>>  {
>> -	kfree(css_misc(css));
>> +	struct misc_cg *cg = css_misc(css);
>> +	enum misc_res_type i;
>> +
>> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
>> +		if (cg->res[i].free)
>> +			cg->res[i].free(cg);
>> +
>> +	kfree(cg);
>>  }
>>
>>  /* Cgroup controller callbacks */
>> --
>> 2.25.1
>
> Since the only existing client feature requires all callbacks, should
> this not have that as an invariant?
>
> I.e. it might be better to fail unless *all* ops are non-nil (e.g. to
> catch issues in the kernel code).
>

These callbacks are chained from cgroup_subsys, and they are defined  
separately so it'd be better follow the same pattern.  Or change together  
with cgroup_subsys if we want to do that. Reasonable?

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
@ 2023-09-26 13:10         ` Jarkko Sakkinen
  0 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-26 13:10 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

On Tue Sep 26, 2023 at 6:04 AM EEST, Haitao Huang wrote:
> Hi Jarkko
>
> On Mon, 25 Sep 2023 12:09:21 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
> wrote:
>
> > On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> >>
> >> The misc cgroup controller (subsystem) currently does not perform
> >> resource type specific action for Cgroups Subsystem State (CSS) events:
> >> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> >> when a cgroup is destroyed, or in event of user writing the max value to
> >> the misc.max file to set the usage limit of a specific resource
> >> [admin-guide/cgroup-v2.rst, 5-9. Misc].
> >>
> >> Define callbacks for those events and allow resource providers to
> >> register the callbacks per resource type as needed. This will be
> >> utilized later by the EPC misc cgroup support implemented in the SGX
> >> driver:
> >> - On css_alloc, allocate and initialize necessary structures for EPC
> >> reclaiming, e.g., LRU list, work queue, etc.
> >> - On css_free, cleanup and free those structures created in alloc.
> >> - On max_write, trigger EPC reclaiming if the new limit is at or below
> >> current usage.
> >>
> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> >> ---
> >> V5:
> >> - Remove prefixes from the callback names (tj)
> >> - Update commit message (Jarkko)
> >>
> >> V4:
> >> - Moved this to the front of the series.
> >> - Applies on cgroup/for-6.6 with the overflow fix for misc.
> >>
> >> V3:
> >> - Removed the released() callback
> >> ---
> >>  include/linux/misc_cgroup.h |  5 +++++
> >>  kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
> >>  2 files changed, 34 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> >> index e799b1f8d05b..96a88822815a 100644
> >> --- a/include/linux/misc_cgroup.h
> >> +++ b/include/linux/misc_cgroup.h
> >> @@ -37,6 +37,11 @@ struct misc_res {
> >>  	u64 max;
> >>  	atomic64_t usage;
> >>  	atomic64_t events;
> >> +
> >> +	/* per resource callback ops */
> >> +	int (*alloc)(struct misc_cg *cg);
> >> +	void (*free)(struct misc_cg *cg);
> >> +	void (*max_write)(struct misc_cg *cg);
> >>  };
> >>
> >>  /**
> >> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> >> index 79a3717a5803..62c9198dee21 100644
> >> --- a/kernel/cgroup/misc.c
> >> +++ b/kernel/cgroup/misc.c
> >> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct  
> >> kernfs_open_file *of, char *buf,
> >>
> >>  	cg = css_misc(of_css(of));
> >>
> >> -	if (READ_ONCE(misc_res_capacity[type]))
> >> +	if (READ_ONCE(misc_res_capacity[type])) {
> >>  		WRITE_ONCE(cg->res[type].max, max);
> >> -	else
> >> +		if (cg->res[type].max_write)
> >> +			cg->res[type].max_write(cg);
> >> +	} else {
> >>  		ret = -EINVAL;
> >> +	}
> >>
> >>  	return ret ? ret : nbytes;
> >>  }
> >> @@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
> >>  static struct cgroup_subsys_state *
> >>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
> >>  {
> >> +	struct misc_cg *parent_cg;
> >>  	enum misc_res_type i;
> >>  	struct misc_cg *cg;
> >> +	int ret;
> >>
> >>  	if (!parent_css) {
> >>  		cg = &root_cg;
> >> +		parent_cg = &root_cg;
> >>  	} else {
> >>  		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
> >>  		if (!cg)
> >>  			return ERR_PTR(-ENOMEM);
> >> +		parent_cg = css_misc(parent_css);
> >>  	}
> >>
> >>  	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
> >>  		WRITE_ONCE(cg->res[i].max, MAX_NUM);
> >>  		atomic64_set(&cg->res[i].usage, 0);
> >> +		if (parent_cg->res[i].alloc) {
> >> +			ret = parent_cg->res[i].alloc(cg);
> >> +			if (ret)
> >> +				goto alloc_err;
> >> +		}
> >>  	}
> >>
> >>  	return &cg->css;
> >> +
> >> +alloc_err:
> >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> >> +		if (parent_cg->res[i].free)
> >> +			cg->res[i].free(cg);
> >> +	kfree(cg);
> >> +	return ERR_PTR(ret);
> >>  }
> >>
> >>  /**
> >> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state  
> >> *parent_css)
> >>   */
> >>  static void misc_cg_free(struct cgroup_subsys_state *css)
> >>  {
> >> -	kfree(css_misc(css));
> >> +	struct misc_cg *cg = css_misc(css);
> >> +	enum misc_res_type i;
> >> +
> >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> >> +		if (cg->res[i].free)
> >> +			cg->res[i].free(cg);
> >> +
> >> +	kfree(cg);
> >>  }
> >>
> >>  /* Cgroup controller callbacks */
> >> --
> >> 2.25.1
> >
> > Since the only existing client feature requires all callbacks, should
> > this not have that as an invariant?
> >
> > I.e. it might be better to fail unless *all* ops are non-nil (e.g. to
> > catch issues in the kernel code).
> >
>
> These callbacks are chained from cgroup_subsys, and they are defined  
> separately so it'd be better follow the same pattern.  Or change together  
> with cgroup_subsys if we want to do that. Reasonable?

I noticed this one later:

It would better to create a separate ops struct and declare the instance
as const at minimum.

Then there is no need for dynamic assigment of ops and all that is in
rodata. This is improves both security and also allows static analysis
bit better.

Now you have to dynamically trace the struct instance, e.g. in case of
a bug. If this one done, it would be already in the vmlinux.

BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
@ 2023-09-26 13:10         ` Jarkko Sakkinen
  0 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-26 13:10 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Tue Sep 26, 2023 at 6:04 AM EEST, Haitao Huang wrote:
> Hi Jarkko
>
> On Mon, 25 Sep 2023 12:09:21 -0500, Jarkko Sakkinen <jarkko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>  
> wrote:
>
> > On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> >>
> >> The misc cgroup controller (subsystem) currently does not perform
> >> resource type specific action for Cgroups Subsystem State (CSS) events:
> >> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> >> when a cgroup is destroyed, or in event of user writing the max value to
> >> the misc.max file to set the usage limit of a specific resource
> >> [admin-guide/cgroup-v2.rst, 5-9. Misc].
> >>
> >> Define callbacks for those events and allow resource providers to
> >> register the callbacks per resource type as needed. This will be
> >> utilized later by the EPC misc cgroup support implemented in the SGX
> >> driver:
> >> - On css_alloc, allocate and initialize necessary structures for EPC
> >> reclaiming, e.g., LRU list, work queue, etc.
> >> - On css_free, cleanup and free those structures created in alloc.
> >> - On max_write, trigger EPC reclaiming if the new limit is at or below
> >> current usage.
> >>
> >> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> >> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> >> ---
> >> V5:
> >> - Remove prefixes from the callback names (tj)
> >> - Update commit message (Jarkko)
> >>
> >> V4:
> >> - Moved this to the front of the series.
> >> - Applies on cgroup/for-6.6 with the overflow fix for misc.
> >>
> >> V3:
> >> - Removed the released() callback
> >> ---
> >>  include/linux/misc_cgroup.h |  5 +++++
> >>  kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
> >>  2 files changed, 34 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> >> index e799b1f8d05b..96a88822815a 100644
> >> --- a/include/linux/misc_cgroup.h
> >> +++ b/include/linux/misc_cgroup.h
> >> @@ -37,6 +37,11 @@ struct misc_res {
> >>  	u64 max;
> >>  	atomic64_t usage;
> >>  	atomic64_t events;
> >> +
> >> +	/* per resource callback ops */
> >> +	int (*alloc)(struct misc_cg *cg);
> >> +	void (*free)(struct misc_cg *cg);
> >> +	void (*max_write)(struct misc_cg *cg);
> >>  };
> >>
> >>  /**
> >> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> >> index 79a3717a5803..62c9198dee21 100644
> >> --- a/kernel/cgroup/misc.c
> >> +++ b/kernel/cgroup/misc.c
> >> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct  
> >> kernfs_open_file *of, char *buf,
> >>
> >>  	cg = css_misc(of_css(of));
> >>
> >> -	if (READ_ONCE(misc_res_capacity[type]))
> >> +	if (READ_ONCE(misc_res_capacity[type])) {
> >>  		WRITE_ONCE(cg->res[type].max, max);
> >> -	else
> >> +		if (cg->res[type].max_write)
> >> +			cg->res[type].max_write(cg);
> >> +	} else {
> >>  		ret = -EINVAL;
> >> +	}
> >>
> >>  	return ret ? ret : nbytes;
> >>  }
> >> @@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
> >>  static struct cgroup_subsys_state *
> >>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
> >>  {
> >> +	struct misc_cg *parent_cg;
> >>  	enum misc_res_type i;
> >>  	struct misc_cg *cg;
> >> +	int ret;
> >>
> >>  	if (!parent_css) {
> >>  		cg = &root_cg;
> >> +		parent_cg = &root_cg;
> >>  	} else {
> >>  		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
> >>  		if (!cg)
> >>  			return ERR_PTR(-ENOMEM);
> >> +		parent_cg = css_misc(parent_css);
> >>  	}
> >>
> >>  	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
> >>  		WRITE_ONCE(cg->res[i].max, MAX_NUM);
> >>  		atomic64_set(&cg->res[i].usage, 0);
> >> +		if (parent_cg->res[i].alloc) {
> >> +			ret = parent_cg->res[i].alloc(cg);
> >> +			if (ret)
> >> +				goto alloc_err;
> >> +		}
> >>  	}
> >>
> >>  	return &cg->css;
> >> +
> >> +alloc_err:
> >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> >> +		if (parent_cg->res[i].free)
> >> +			cg->res[i].free(cg);
> >> +	kfree(cg);
> >> +	return ERR_PTR(ret);
> >>  }
> >>
> >>  /**
> >> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state  
> >> *parent_css)
> >>   */
> >>  static void misc_cg_free(struct cgroup_subsys_state *css)
> >>  {
> >> -	kfree(css_misc(css));
> >> +	struct misc_cg *cg = css_misc(css);
> >> +	enum misc_res_type i;
> >> +
> >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> >> +		if (cg->res[i].free)
> >> +			cg->res[i].free(cg);
> >> +
> >> +	kfree(cg);
> >>  }
> >>
> >>  /* Cgroup controller callbacks */
> >> --
> >> 2.25.1
> >
> > Since the only existing client feature requires all callbacks, should
> > this not have that as an invariant?
> >
> > I.e. it might be better to fail unless *all* ops are non-nil (e.g. to
> > catch issues in the kernel code).
> >
>
> These callbacks are chained from cgroup_subsys, and they are defined  
> separately so it'd be better follow the same pattern.  Or change together  
> with cgroup_subsys if we want to do that. Reasonable?

I noticed this one later:

It would better to create a separate ops struct and declare the instance
as const at minimum.

Then there is no need for dynamic assigment of ops and all that is in
rodata. This is improves both security and also allows static analysis
bit better.

Now you have to dynamically trace the struct instance, e.g. in case of
a bug. If this one done, it would be already in the vmlinux.

BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-26 13:10         ` Jarkko Sakkinen
@ 2023-09-26 13:13           ` Jarkko Sakkinen
  -1 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-26 13:13 UTC (permalink / raw)
  To: Jarkko Sakkinen, Haitao Huang, dave.hansen, tj, linux-kernel,
	linux-sgx, x86, cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

On Tue Sep 26, 2023 at 4:10 PM EEST, Jarkko Sakkinen wrote:
> On Tue Sep 26, 2023 at 6:04 AM EEST, Haitao Huang wrote:
> > Hi Jarkko
> >
> > On Mon, 25 Sep 2023 12:09:21 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
> > wrote:
> >
> > > On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> > >> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> > >>
> > >> The misc cgroup controller (subsystem) currently does not perform
> > >> resource type specific action for Cgroups Subsystem State (CSS) events:
> > >> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> > >> when a cgroup is destroyed, or in event of user writing the max value to
> > >> the misc.max file to set the usage limit of a specific resource
> > >> [admin-guide/cgroup-v2.rst, 5-9. Misc].
> > >>
> > >> Define callbacks for those events and allow resource providers to
> > >> register the callbacks per resource type as needed. This will be
> > >> utilized later by the EPC misc cgroup support implemented in the SGX
> > >> driver:
> > >> - On css_alloc, allocate and initialize necessary structures for EPC
> > >> reclaiming, e.g., LRU list, work queue, etc.
> > >> - On css_free, cleanup and free those structures created in alloc.
> > >> - On max_write, trigger EPC reclaiming if the new limit is at or below
> > >> current usage.
> > >>
> > >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> > >> ---
> > >> V5:
> > >> - Remove prefixes from the callback names (tj)
> > >> - Update commit message (Jarkko)
> > >>
> > >> V4:
> > >> - Moved this to the front of the series.
> > >> - Applies on cgroup/for-6.6 with the overflow fix for misc.
> > >>
> > >> V3:
> > >> - Removed the released() callback
> > >> ---
> > >>  include/linux/misc_cgroup.h |  5 +++++
> > >>  kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
> > >>  2 files changed, 34 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> > >> index e799b1f8d05b..96a88822815a 100644
> > >> --- a/include/linux/misc_cgroup.h
> > >> +++ b/include/linux/misc_cgroup.h
> > >> @@ -37,6 +37,11 @@ struct misc_res {
> > >>  	u64 max;
> > >>  	atomic64_t usage;
> > >>  	atomic64_t events;
> > >> +
> > >> +	/* per resource callback ops */
> > >> +	int (*alloc)(struct misc_cg *cg);
> > >> +	void (*free)(struct misc_cg *cg);
> > >> +	void (*max_write)(struct misc_cg *cg);
> > >>  };
> > >>
> > >>  /**
> > >> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> > >> index 79a3717a5803..62c9198dee21 100644
> > >> --- a/kernel/cgroup/misc.c
> > >> +++ b/kernel/cgroup/misc.c
> > >> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct  
> > >> kernfs_open_file *of, char *buf,
> > >>
> > >>  	cg = css_misc(of_css(of));
> > >>
> > >> -	if (READ_ONCE(misc_res_capacity[type]))
> > >> +	if (READ_ONCE(misc_res_capacity[type])) {
> > >>  		WRITE_ONCE(cg->res[type].max, max);
> > >> -	else
> > >> +		if (cg->res[type].max_write)
> > >> +			cg->res[type].max_write(cg);
> > >> +	} else {
> > >>  		ret = -EINVAL;
> > >> +	}
> > >>
> > >>  	return ret ? ret : nbytes;
> > >>  }
> > >> @@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
> > >>  static struct cgroup_subsys_state *
> > >>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
> > >>  {
> > >> +	struct misc_cg *parent_cg;
> > >>  	enum misc_res_type i;
> > >>  	struct misc_cg *cg;
> > >> +	int ret;
> > >>
> > >>  	if (!parent_css) {
> > >>  		cg = &root_cg;
> > >> +		parent_cg = &root_cg;
> > >>  	} else {
> > >>  		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
> > >>  		if (!cg)
> > >>  			return ERR_PTR(-ENOMEM);
> > >> +		parent_cg = css_misc(parent_css);
> > >>  	}
> > >>
> > >>  	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
> > >>  		WRITE_ONCE(cg->res[i].max, MAX_NUM);
> > >>  		atomic64_set(&cg->res[i].usage, 0);
> > >> +		if (parent_cg->res[i].alloc) {
> > >> +			ret = parent_cg->res[i].alloc(cg);
> > >> +			if (ret)
> > >> +				goto alloc_err;
> > >> +		}
> > >>  	}
> > >>
> > >>  	return &cg->css;
> > >> +
> > >> +alloc_err:
> > >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> > >> +		if (parent_cg->res[i].free)
> > >> +			cg->res[i].free(cg);
> > >> +	kfree(cg);
> > >> +	return ERR_PTR(ret);
> > >>  }
> > >>
> > >>  /**
> > >> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state  
> > >> *parent_css)
> > >>   */
> > >>  static void misc_cg_free(struct cgroup_subsys_state *css)
> > >>  {
> > >> -	kfree(css_misc(css));
> > >> +	struct misc_cg *cg = css_misc(css);
> > >> +	enum misc_res_type i;
> > >> +
> > >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> > >> +		if (cg->res[i].free)
> > >> +			cg->res[i].free(cg);
> > >> +
> > >> +	kfree(cg);
> > >>  }
> > >>
> > >>  /* Cgroup controller callbacks */
> > >> --
> > >> 2.25.1
> > >
> > > Since the only existing client feature requires all callbacks, should
> > > this not have that as an invariant?
> > >
> > > I.e. it might be better to fail unless *all* ops are non-nil (e.g. to
> > > catch issues in the kernel code).
> > >
> >
> > These callbacks are chained from cgroup_subsys, and they are defined  
> > separately so it'd be better follow the same pattern.  Or change together  
> > with cgroup_subsys if we want to do that. Reasonable?
>
> I noticed this one later:
>
> It would better to create a separate ops struct and declare the instance
> as const at minimum.
>
> Then there is no need for dynamic assigment of ops and all that is in
> rodata. This is improves both security and also allows static analysis
> bit better.
>
> Now you have to dynamically trace the struct instance, e.g. in case of
> a bug. If this one done, it would be already in the vmlinux.

I.e. then in the driver you can have static const struct declaration
with *all* pointers pre-assigned.

Not sure if cgroups follows this or not but it is *objectively*
better. Previous work is not always best possible work...

BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
@ 2023-09-26 13:13           ` Jarkko Sakkinen
  0 siblings, 0 replies; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-09-26 13:13 UTC (permalink / raw)
  To: Jarkko Sakkinen, Haitao Huang,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Tue Sep 26, 2023 at 4:10 PM EEST, Jarkko Sakkinen wrote:
> On Tue Sep 26, 2023 at 6:04 AM EEST, Haitao Huang wrote:
> > Hi Jarkko
> >
> > On Mon, 25 Sep 2023 12:09:21 -0500, Jarkko Sakkinen <jarkko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>  
> > wrote:
> >
> > > On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> > >> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > >>
> > >> The misc cgroup controller (subsystem) currently does not perform
> > >> resource type specific action for Cgroups Subsystem State (CSS) events:
> > >> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> > >> when a cgroup is destroyed, or in event of user writing the max value to
> > >> the misc.max file to set the usage limit of a specific resource
> > >> [admin-guide/cgroup-v2.rst, 5-9. Misc].
> > >>
> > >> Define callbacks for those events and allow resource providers to
> > >> register the callbacks per resource type as needed. This will be
> > >> utilized later by the EPC misc cgroup support implemented in the SGX
> > >> driver:
> > >> - On css_alloc, allocate and initialize necessary structures for EPC
> > >> reclaiming, e.g., LRU list, work queue, etc.
> > >> - On css_free, cleanup and free those structures created in alloc.
> > >> - On max_write, trigger EPC reclaiming if the new limit is at or below
> > >> current usage.
> > >>
> > >> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > >> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > >> ---
> > >> V5:
> > >> - Remove prefixes from the callback names (tj)
> > >> - Update commit message (Jarkko)
> > >>
> > >> V4:
> > >> - Moved this to the front of the series.
> > >> - Applies on cgroup/for-6.6 with the overflow fix for misc.
> > >>
> > >> V3:
> > >> - Removed the released() callback
> > >> ---
> > >>  include/linux/misc_cgroup.h |  5 +++++
> > >>  kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
> > >>  2 files changed, 34 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> > >> index e799b1f8d05b..96a88822815a 100644
> > >> --- a/include/linux/misc_cgroup.h
> > >> +++ b/include/linux/misc_cgroup.h
> > >> @@ -37,6 +37,11 @@ struct misc_res {
> > >>  	u64 max;
> > >>  	atomic64_t usage;
> > >>  	atomic64_t events;
> > >> +
> > >> +	/* per resource callback ops */
> > >> +	int (*alloc)(struct misc_cg *cg);
> > >> +	void (*free)(struct misc_cg *cg);
> > >> +	void (*max_write)(struct misc_cg *cg);
> > >>  };
> > >>
> > >>  /**
> > >> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> > >> index 79a3717a5803..62c9198dee21 100644
> > >> --- a/kernel/cgroup/misc.c
> > >> +++ b/kernel/cgroup/misc.c
> > >> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct  
> > >> kernfs_open_file *of, char *buf,
> > >>
> > >>  	cg = css_misc(of_css(of));
> > >>
> > >> -	if (READ_ONCE(misc_res_capacity[type]))
> > >> +	if (READ_ONCE(misc_res_capacity[type])) {
> > >>  		WRITE_ONCE(cg->res[type].max, max);
> > >> -	else
> > >> +		if (cg->res[type].max_write)
> > >> +			cg->res[type].max_write(cg);
> > >> +	} else {
> > >>  		ret = -EINVAL;
> > >> +	}
> > >>
> > >>  	return ret ? ret : nbytes;
> > >>  }
> > >> @@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
> > >>  static struct cgroup_subsys_state *
> > >>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
> > >>  {
> > >> +	struct misc_cg *parent_cg;
> > >>  	enum misc_res_type i;
> > >>  	struct misc_cg *cg;
> > >> +	int ret;
> > >>
> > >>  	if (!parent_css) {
> > >>  		cg = &root_cg;
> > >> +		parent_cg = &root_cg;
> > >>  	} else {
> > >>  		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
> > >>  		if (!cg)
> > >>  			return ERR_PTR(-ENOMEM);
> > >> +		parent_cg = css_misc(parent_css);
> > >>  	}
> > >>
> > >>  	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
> > >>  		WRITE_ONCE(cg->res[i].max, MAX_NUM);
> > >>  		atomic64_set(&cg->res[i].usage, 0);
> > >> +		if (parent_cg->res[i].alloc) {
> > >> +			ret = parent_cg->res[i].alloc(cg);
> > >> +			if (ret)
> > >> +				goto alloc_err;
> > >> +		}
> > >>  	}
> > >>
> > >>  	return &cg->css;
> > >> +
> > >> +alloc_err:
> > >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> > >> +		if (parent_cg->res[i].free)
> > >> +			cg->res[i].free(cg);
> > >> +	kfree(cg);
> > >> +	return ERR_PTR(ret);
> > >>  }
> > >>
> > >>  /**
> > >> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state  
> > >> *parent_css)
> > >>   */
> > >>  static void misc_cg_free(struct cgroup_subsys_state *css)
> > >>  {
> > >> -	kfree(css_misc(css));
> > >> +	struct misc_cg *cg = css_misc(css);
> > >> +	enum misc_res_type i;
> > >> +
> > >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> > >> +		if (cg->res[i].free)
> > >> +			cg->res[i].free(cg);
> > >> +
> > >> +	kfree(cg);
> > >>  }
> > >>
> > >>  /* Cgroup controller callbacks */
> > >> --
> > >> 2.25.1
> > >
> > > Since the only existing client feature requires all callbacks, should
> > > this not have that as an invariant?
> > >
> > > I.e. it might be better to fail unless *all* ops are non-nil (e.g. to
> > > catch issues in the kernel code).
> > >
> >
> > These callbacks are chained from cgroup_subsys, and they are defined  
> > separately so it'd be better follow the same pattern.  Or change together  
> > with cgroup_subsys if we want to do that. Reasonable?
>
> I noticed this one later:
>
> It would better to create a separate ops struct and declare the instance
> as const at minimum.
>
> Then there is no need for dynamic assigment of ops and all that is in
> rodata. This is improves both security and also allows static analysis
> bit better.
>
> Now you have to dynamically trace the struct instance, e.g. in case of
> a bug. If this one done, it would be already in the vmlinux.

I.e. then in the driver you can have static const struct declaration
with *all* pointers pre-assigned.

Not sure if cgroups follows this or not but it is *objectively*
better. Previous work is not always best possible work...

BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-26 13:13           ` Jarkko Sakkinen
  (?)
@ 2023-09-27  1:56           ` Haitao Huang
  2023-10-02 22:47             ` Jarkko Sakkinen
  -1 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-09-27  1:56 UTC (permalink / raw)
  To: Jarkko Sakkinen, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

On Tue, 26 Sep 2023 08:13:18 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:

...
>> > >>  /**
>> > >> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state
>> > >> *parent_css)
>> > >>   */
>> > >>  static void misc_cg_free(struct cgroup_subsys_state *css)
>> > >>  {
>> > >> -	kfree(css_misc(css));
>> > >> +	struct misc_cg *cg = css_misc(css);
>> > >> +	enum misc_res_type i;
>> > >> +
>> > >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
>> > >> +		if (cg->res[i].free)
>> > >> +			cg->res[i].free(cg);
>> > >> +
>> > >> +	kfree(cg);
>> > >>  }
>> > >>
>> > >>  /* Cgroup controller callbacks */
>> > >> --
>> > >> 2.25.1
>> > >
>> > > Since the only existing client feature requires all callbacks,  
>> should
>> > > this not have that as an invariant?
>> > >
>> > > I.e. it might be better to fail unless *all* ops are non-nil (e.g.  
>> to
>> > > catch issues in the kernel code).
>> > >
>> >
>> > These callbacks are chained from cgroup_subsys, and they are defined
>> > separately so it'd be better follow the same pattern.  Or change  
>> together
>> > with cgroup_subsys if we want to do that. Reasonable?
>>
>> I noticed this one later:
>>
>> It would better to create a separate ops struct and declare the instance
>> as const at minimum.
>>
>> Then there is no need for dynamic assigment of ops and all that is in
>> rodata. This is improves both security and also allows static analysis
>> bit better.
>>
>> Now you have to dynamically trace the struct instance, e.g. in case of
>> a bug. If this one done, it would be already in the vmlinux.
>I.e. then in the driver you can have static const struct declaration
> with *all* pointers pre-assigned.
>
> Not sure if cgroups follows this or not but it is *objectively*
> better. Previous work is not always best possible work...
>

IIUC, like vm_ops field in vma structs. Although function pointers in  
vm_ops are assigned statically, but you still need dynamically assign  
vm_ops for each instance of vma.

So the code will look like this:

if (parent_cg->res[i].misc_ops && parent_cg->res[i].misc_ops->alloc)
{
...
}

I don't see this is the pattern used in cgroups and no strong opinion  
either way.

TJ, do you have preference on this?

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-23  3:06   ` Haitao Huang
  (?)
  (?)
@ 2023-09-27  9:20   ` Huang, Kai
  2023-10-03 14:29     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-09-27  9:20 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> The misc cgroup controller (subsystem) currently does not perform
> resource type specific action for Cgroups Subsystem State (CSS) events:
> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> when a cgroup is destroyed, or in event of user writing the max value to
> the misc.max file to set the usage limit of a specific resource
> [admin-guide/cgroup-v2.rst, 5-9. Misc].
> 
> Define callbacks for those events and allow resource providers to
> register the callbacks per resource type as needed. This will be
> utilized later by the EPC misc cgroup support implemented in the SGX
> driver:
> - On css_alloc, allocate and initialize necessary structures for EPC
> reclaiming, e.g., LRU list, work queue, etc.
> - On css_free, cleanup and free those structures created in alloc.
> - On max_write, trigger EPC reclaiming if the new limit is at or below
> current usage.

Nit:

Wondering why we should trigger EPC reclaiming if the new limit is *at* current
usage?

I actually don't quite care about why here, but writing these details in the
changelog may bring unnecessary confusion.  I guess you can just remove all the
details about what SGX driver needs to do on these callbacks.

> 
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> ---
> V5:
> - Remove prefixes from the callback names (tj)
> - Update commit message (Jarkko)
> 
> V4:
> - Moved this to the front of the series.
> - Applies on cgroup/for-6.6 with the overflow fix for misc.
> 
> V3:
> - Removed the released() callback
> ---
>  include/linux/misc_cgroup.h |  5 +++++
>  kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
>  2 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> index e799b1f8d05b..96a88822815a 100644
> --- a/include/linux/misc_cgroup.h
> +++ b/include/linux/misc_cgroup.h
> @@ -37,6 +37,11 @@ struct misc_res {
>  	u64 max;
>  	atomic64_t usage;
>  	atomic64_t events;
> +
> +	/* per resource callback ops */

Nit:

This comment isn't quite useful IMHO.  And it seems you should just expand the
existing comment for the 'struct misc_res', which already covers the existing
members.

Or as Jarkko suggested, maybe you can introduce another structure 'misc_res_ops'
and comment more details for all these callbacks just like 'struct misc_res'.

Anyway it's cgroup maintainer's call.

> +	int (*alloc)(struct misc_cg *cg);
> +	void (*free)(struct misc_cg *cg);
> +	void (*max_write)(struct misc_cg *cg);
>  };
>  
>  /**
> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> index 79a3717a5803..62c9198dee21 100644
> --- a/kernel/cgroup/misc.c
> +++ b/kernel/cgroup/misc.c
> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
>  
>  	cg = css_misc(of_css(of));
>  
> -	if (READ_ONCE(misc_res_capacity[type]))
> +	if (READ_ONCE(misc_res_capacity[type])) {
>  		WRITE_ONCE(cg->res[type].max, max);
> -	else
> +		if (cg->res[type].max_write)
> +			cg->res[type].max_write(cg);
> +	} else {
>  		ret = -EINVAL;
> +	}
>  
>  	return ret ? ret : nbytes;
>  }
> @@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
>  static struct cgroup_subsys_state *
>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>  {
> +	struct misc_cg *parent_cg;

Nit: 

The below variable '*cg' can be moved here together with 'parent_cg'.
 
>  	enum misc_res_type i;
>  	struct misc_cg *cg;
> +	int ret;
>  
>  	if (!parent_css) {
>  		cg = &root_cg;
> +		parent_cg = &root_cg;

Nit:
		parent_cg = cg = &root_cg;
?

>  	} else {
>  		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
>  		if (!cg)
>  			return ERR_PTR(-ENOMEM);
> +		parent_cg = css_misc(parent_css);
>  	}
>  
>  	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
>  		WRITE_ONCE(cg->res[i].max, MAX_NUM);
>  		atomic64_set(&cg->res[i].usage, 0);
> +		if (parent_cg->res[i].alloc) {
> +			ret = parent_cg->res[i].alloc(cg);
> +			if (ret)
> +				goto alloc_err;
> +		}
>  	}
>  
>  	return &cg->css;
> +
> +alloc_err:
> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> +		if (parent_cg->res[i].free)
> +			cg->res[i].free(cg);
> +	kfree(cg);
> +	return ERR_PTR(ret);
>  }
>  
>  /**
> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>   */
>  static void misc_cg_free(struct cgroup_subsys_state *css)
>  {
> -	kfree(css_misc(css));
> +	struct misc_cg *cg = css_misc(css);
> +	enum misc_res_type i;
> +
> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> +		if (cg->res[i].free)
> +			cg->res[i].free(cg);
> +
> +	kfree(cg);
>  }
>  
>  /* Cgroup controller callbacks */


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 05/18] x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists
  2023-09-23  3:06   ` Haitao Huang
  (?)
@ 2023-09-27 10:14   ` Huang, Kai
  -1 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-09-27 10:14 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -268,7 +268,6 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>  			goto out;
>  
>  		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
> -
>  		sgx_encl_free_epc_page(encl->secs.epc_page);
>  		encl->secs.epc_page = NULL;
>  

Nit: perhaps unintended change.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 06/18] x86/sgx: Introduce EPC page states
  2023-09-23  3:06   ` Haitao Huang
  (?)
  (?)
@ 2023-09-27 10:28   ` Huang, Kai
  2023-10-03  4:49     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-09-27 10:28 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> Use the lower 3 bits in the flags field of sgx_epc_page struct to
> track EPC states in its life cycle and define an enum for possible
> states. More state(s) will be added later.

This patch does more than what the changelog claims to do.  AFAICT it does
below:

 1) Use the lower 3 bits to track EPC page status
 2) Rename SGX_EPC_PAGE_RECLAIMER_TRACKED to SGX_EPC_PAGE_RERCLAIMABLE
 3) Introduce a new state SGX_EPC_PAGE_UNRECLAIMABLE
 4) Track SECS and VA pages as SGX_EPC_PAGE_UNRECLAIMABLE

The changelog only says 1) IIUC.

If we really want to do all these in one patch, then the changelog should at
least mention the justification of all of them.

But I don't see why 3) and 4) need to be done here.  Instead, IMHO they should
be done in a separate patch, and do it after the unreclaimable list is
introduced (or you need to bring that patch forward).


For instance, ...

[snip]

> +
> +	/* Page is in use but tracked in an unreclaimable LRU list. These are
> +	 * only reclaimable when the whole enclave is OOM killed or the enclave
> +	 * is released, e.g., VA, SECS pages
> +	 * Becomes NOT_TRACKED after sgx_drop_epc()
> +	 */
> +	SGX_EPC_PAGE_UNRECLAIMABLE = 3,

... We even don't have the unreclaimable LRU list yet.  It's odd to have this
comment here.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state
  2023-09-23  3:06   ` Haitao Huang
  (?)
  (?)
@ 2023-09-27 10:42   ` Huang, Kai
  -1 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-09-27 10:42 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish


> @@ -312,13 +312,15 @@ static void sgx_reclaim_pages(void)
>  		list_del_init(&epc_page->list);
>  		encl_page = epc_page->owner;
>  
> -		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
> +		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
> +			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
>  			chunk[cnt++] = epc_page;
> -		else
> +		} else {
>  			/* The owner is freeing the page. No need to add the
>  			 * page back to the list of reclaimable pages.
>  			 */

Please use proper comment style:

For single line comment:

	/* ... */

For multiple lines comment:

	/*
	 * ...
	 */
	
>  			sgx_epc_page_reset_state(epc_page);
> +		}

Nit: unintended new {} around 'else'?

>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> @@ -528,16 +530,13 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  int sgx_drop_epc_page(struct sgx_epc_page *page)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	if (sgx_epc_page_reclaimable(page->flags)) {
> -		/* The page is being reclaimed. */
> -		if (list_empty(&page->list)) {
> -			spin_unlock(&sgx_global_lru.lock);
> -			return -EBUSY;
> -		}
> -
> -		list_del(&page->list);
> -		sgx_epc_page_reset_state(page);
> +	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
> +		spin_unlock(&sgx_global_lru.lock);
> +		return -EBUSY;
>  	}
> +
> +	list_del(&page->list);
> +	sgx_epc_page_reset_state(page);
>  	spin_unlock(&sgx_global_lru.lock);
>  
>  	return 0;
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 2faeb40b345f..764cec23f4e5 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -40,6 +40,8 @@ enum sgx_epc_page_state {
>  
>  	/* Page is in use and tracked in a reclaimable LRU list
>  	 * Becomes NOT_TRACKED after sgx_drop_epc()
> +	 * Becomes RECLAIM_IN_PROGRESS in sgx_reclaim_pages() when identified
> +	 * for reclaiming
>  	 */
>  	SGX_EPC_PAGE_RECLAIMABLE = 2,
>  
> @@ -50,6 +52,14 @@ enum sgx_epc_page_state {
>  	 */
>  	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
>  
> +	/* Page is being prepared for reclamation, tracked in a temporary
> +	 * isolated list by the reclaimer.
> +	 * Changes in sgx_reclaim_pages() back to RECLAIMABLE if preparation
> +	 * fails for any reason.
> +	 * Becomes NOT_TRACKED if reclaimed successfully in sgx_reclaim_pages()
> +	 * and immediately sgx_free_epc() is called to make it FREE.
> +	 */

Ditto for  the comment style.

And please use blank line to separate paragraphs for better readability.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-09-23  3:06   ` Haitao Huang
  (?)
@ 2023-09-27 11:14   ` Huang, Kai
  2023-09-27 15:35     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-09-27 11:14 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> In a later patch, when a cgroup has exceeded the max capacity for EPC
> pages, it may need to identify and OOM kill a less active enclave to
> make room for other enclaves within the same group. Such a victim
> enclave would have no active pages other than the unreclaimable Version
> Array (VA) and SECS pages.  Therefore, the cgroup needs examine its
> unreclaimable page list, and finding an enclave given a SECS page or a
> VA page. This will require a backpointer from a page to an enclave,
> which is not available for VA pages.
> 
> Because struct sgx_epc_page instances of VA pages are not owned by an
> sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
> sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
> which will store this value in the owner field of the struct
> sgx_epc_page.  In a later patch, VA pages will be placed in an
> unreclaimable queue that can be examined by the cgroup to select the OOM
> killed enclave.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> 

[...]

> @@ -562,7 +562,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  	for ( ; ; ) {
>  		page = __sgx_alloc_epc_page();
>  		if (!IS_ERR(page)) {
> -			page->owner = owner;
> +			page->encl_page = owner;

Looks using 'encl_page' is arbitrary.

Also actually for virtual EPC page the owner is set to the 'sgx_vepc' instance.

>  			break;
>  		}
>  
> @@ -607,7 +607,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  
>  	spin_lock(&node->lock);
>  
> -	page->owner = NULL;
> +	page->encl_page = NULL;

Ditto.

>  	if (page->poison)
>  		list_add(&page->list, &node->sgx_poison_page_list);
>  	else
> @@ -642,7 +642,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  	for (i = 0; i < nr_pages; i++) {
>  		section->pages[i].section = index;
>  		section->pages[i].flags = 0;
> -		section->pages[i].owner = NULL;
> +		section->pages[i].encl_page = NULL;
>  		section->pages[i].poison = 0;
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 764cec23f4e5..5110dd433b80 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -68,7 +68,12 @@ struct sgx_epc_page {
>  	unsigned int section;
>  	u16 flags;
>  	u16 poison;
> -	struct sgx_encl_page *owner;
> +
> +	/* Possible owner types */
> +	union {
> +		struct sgx_encl_page *encl_page;
> +		struct sgx_encl *encl;
> +	};

Sadly for virtual EPC page the owner is set to the 'sgx_vepc' instance it
belongs to.

Given how sgx_{alloc|free}_epc_page() arbitrarily uses encl_page, perhaps we
should do below?

	union {
		struct sgx_encl_page *encl_page;
		struct sgx_encl *encl;
		struct sgx_vepc *vepc;
		void *owner;
	};

And in sgx_{alloc|free}_epc_page() we can use 'owner' instead.




^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-09-23  3:06   ` Haitao Huang
  (?)
  (?)
@ 2023-09-27 11:35   ` Huang, Kai
  2023-10-03  6:45     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-09-27 11:35 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> In a later patch, when a cgroup has exceeded the max capacity for EPC
> pages, it may need to identify and OOM kill a less active enclave to
> make room for other enclaves within the same group. Such a victim
> enclave would have no active pages other than the unreclaimable Version
> Array (VA) and SECS pages.  
> 

What does "no active pages" mean?

A "less active enclave" doesn't necessarily mean it has "no active pages"?


> Therefore, the cgroup needs examine its
			^
			needs to

> unreclaimable page list, and finding an enclave given a SECS page or a
				^
				find

> VA page. This will require a backpointer from a page to an enclave,
> which is not available for VA pages.
> 
> Because struct sgx_epc_page instances of VA pages are not owned by an
> sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
> sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
> which will store this value in the owner field of the struct
> sgx_epc_page.  
> 

IMHO this paragraph is hard to understand and can be more concise:

One VA page can be shared by multiple enclave pages thus cannot be associated
with any 'struct sgx_encl_page' instance.  Set the owner of VA page to the
enclave instead.


> In a later patch, VA pages will be placed in an
> unreclaimable queue that can be examined by the cgroup to select the OOM
> killed enclave.

The code to "place the VA page to unreclaimable queue" has been done in earlier
patch ("x86/sgx: Introduce EPC page states").  Just the unreclaimable list isn't
introduced yet.  I think you should just introduce it first then you can get rid
of those "in a later patch" staff.

And nit: please use "unreclaimable list" consistently (not queue).


Btw, probably a dumb question:

Theoretically if you only need to find a victim enclave you don't need to put VA
pages to the unreclaimable list, because those VA pages will be freed anyway
when enclave is killed.  So keeping VA pages in the list is for accounting all
the pages that the cgroup is having?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 11/18] x86/sgx: store unreclaimable pages in LRU lists
  2023-09-23  3:06   ` Haitao Huang
  (?)
@ 2023-09-27 11:57   ` Huang, Kai
  2023-10-03  5:42     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-09-27 11:57 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> When an OOM event occurs, all pages associated with an enclave will need
> to be freed, including pages that are not currently tracked by the
> cgroup LRU lists.

What are "cgroup LRU lists"?

> 
> Add a new "unreclaimable" list to the sgx_epc_lru_lists struct and
> update the "sgx_record/drop_epc_pages()" functions for adding/removing
> VA and SECS pages to/from this "unreclaimable" list.

Sorry I don't follow the logic between the two paragraphs.

What is the exact problem?  How does the new "unreclaimable" list solve the
problem?



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-09-27 11:14   ` Huang, Kai
@ 2023-09-27 15:35     ` Haitao Huang
  2023-09-27 21:21       ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-09-27 15:35 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

Hi Kai,

On Wed, 27 Sep 2023 06:14:20 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> In a later patch, when a cgroup has exceeded the max capacity for EPC
>> pages, it may need to identify and OOM kill a less active enclave to
>> make room for other enclaves within the same group. Such a victim
>> enclave would have no active pages other than the unreclaimable Version
>> Array (VA) and SECS pages.  Therefore, the cgroup needs examine its
>> unreclaimable page list, and finding an enclave given a SECS page or a
>> VA page. This will require a backpointer from a page to an enclave,
>> which is not available for VA pages.
>>
>> Because struct sgx_epc_page instances of VA pages are not owned by an
>> sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
>> sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
>> which will store this value in the owner field of the struct
>> sgx_epc_page.  In a later patch, VA pages will be placed in an
>> unreclaimable queue that can be examined by the cgroup to select the OOM
>> killed enclave.
>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>>
>
> [...]
>
>> @@ -562,7 +562,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void  
>> *owner, bool reclaim)
>>  	for ( ; ; ) {
>>  		page = __sgx_alloc_epc_page();
>>  		if (!IS_ERR(page)) {
>> -			page->owner = owner;
>> +			page->encl_page = owner;
>
> Looks using 'encl_page' is arbitrary.
>
> Also actually for virtual EPC page the owner is set to the 'sgx_vepc'  
> instance.
>
>>  			break;
>>  		}
>>
>> @@ -607,7 +607,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>>
>>  	spin_lock(&node->lock);
>>
>> -	page->owner = NULL;
>> +	page->encl_page = NULL;
>
> Ditto.
>
>>  	if (page->poison)
>>  		list_add(&page->list, &node->sgx_poison_page_list);
>>  	else
>> @@ -642,7 +642,7 @@ static bool __init sgx_setup_epc_section(u64  
>> phys_addr, u64 size,
>>  	for (i = 0; i < nr_pages; i++) {
>>  		section->pages[i].section = index;
>>  		section->pages[i].flags = 0;
>> -		section->pages[i].owner = NULL;
>> +		section->pages[i].encl_page = NULL;
>>  		section->pages[i].poison = 0;
>>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>>  	}
>> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h  
>> b/arch/x86/kernel/cpu/sgx/sgx.h
>> index 764cec23f4e5..5110dd433b80 100644
>> --- a/arch/x86/kernel/cpu/sgx/sgx.h
>> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
>> @@ -68,7 +68,12 @@ struct sgx_epc_page {
>>  	unsigned int section;
>>  	u16 flags;
>>  	u16 poison;
>> -	struct sgx_encl_page *owner;
>> +
>> +	/* Possible owner types */
>> +	union {
>> +		struct sgx_encl_page *encl_page;
>> +		struct sgx_encl *encl;
>> +	};
>
> Sadly for virtual EPC page the owner is set to the 'sgx_vepc' instance it
> belongs to.
>
> Given how sgx_{alloc|free}_epc_page() arbitrarily uses encl_page,  
> perhaps we
> should do below?
>
> 	union {
> 		struct sgx_encl_page *encl_page;
> 		struct sgx_encl *encl;
> 		struct sgx_vepc *vepc;
> 		void *owner;
> 	};
>
> And in sgx_{alloc|free}_epc_page() we can use 'owner' instead.
>

As I mentioned in cover letter and change log in 11/18, this series does  
not track virtual EPC.
We can add vepc field into the union in future if such tracking is needed.  
Don't think "void *owner" is needed though.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-09-27 15:35     ` Haitao Huang
@ 2023-09-27 21:21       ` Huang, Kai
  2023-09-29 15:06         ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-09-27 21:21 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, haitao.huang, tglx, tj, Mehta, Sohil
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Wed, 2023-09-27 at 10:35 -0500, Haitao Huang wrote:
> > > +
> > > +	/* Possible owner types */
> > > +	union {
> > > +		struct sgx_encl_page *encl_page;
> > > +		struct sgx_encl *encl;
> > > +	};
> > 
> > Sadly for virtual EPC page the owner is set to the 'sgx_vepc' instance it
> > belongs to.
> > 
> > Given how sgx_{alloc|free}_epc_page() arbitrarily uses encl_page,  
> > perhaps we
> > should do below?
> > 
> >  	union {
> >  		struct sgx_encl_page *encl_page;
> >  		struct sgx_encl *encl;
> >  		struct sgx_vepc *vepc;
> >  		void *owner;
> >  	};
> > 
> > And in sgx_{alloc|free}_epc_page() we can use 'owner' instead.
> > 
> 
> As I mentioned in cover letter and change log in 11/18, this series does  
> not track virtual EPC.

It's not about how does the cover letter says.  We cannot ignore the fact that
currently virtual EPC uses owner too.

But given the virtual EPC code currently doesn't use the owner, I can live with
not having the 'vepc' member in the union now.

> We can add vepc field into the union in future if such tracking is needed.  
> Don't think "void *owner" is needed though.

As mentioned, using 'encl_page' arbitrarily in sgx_alloc_epc_page() doesn't look
nice.  Do you have example in the current kernel code to prove it is acceptable?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
  2023-09-23  3:06   ` Haitao Huang
  (?)
  (?)
@ 2023-09-28  3:59   ` Huang, Kai
  2023-10-03  7:00     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-09-28  3:59 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
> for the misc controller.
> 
> Add per resource type private data so that SGX can store additional per
> cgroup data in misc_cg->misc_cg_res[MISC_CG_RES_SGX_EPC].

To be honest I don't quite understand why putting the above two changes in this
patch together with exporting misc_cg_root/parent() below.

Any reason why the above two cannot be done together with patch (" x86/sgx:
Limit process EPC usage with misc cgroup controller"), where these changes are
actually related?

We all already know that a new EPC misc cgroup will be added.  There's no need
to actually introduce the new type here only to justify exporting some helper
functions.

> 
> Export misc_cg_root() so the SGX driver can initialize and add those
> additional structures to the root misc cgroup as part of initialization
> for EPC cgroup support. This bootstraps the same additional
> initialization for non-root cgroups in the 'alloc()' callback added in the
> previous patch.
> 
> The SGX driver, as the EPC memory provider, will have a background
> worker to reclaim EPC pages to make room for new allocations in the same
> cgroup when its usage counter reaches near the limit controlled by the
> cgroup and its ancestors. Therefore it needs to do a walk from the
> current cgroup up to the root. To enable this walk, move parent_misc()
> into misc_cgroup.h and make inline to make this function available to
> SGX, rename it to misc_cg_parent(), and update kernel/cgroup/misc.c to
> use the new name.

Looks too many details in the above two paragraphs.  Could we have a more
concise justification for exporting these two functions?

And if it were me, I would put it at a relatively later position (e.g., before
the patch actually implements EPC cgroup) for better review.  This also applies
to the first patch. 

> 
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> ---
> V5:
> - Revised commit message (Jarkko)
> 
> V4:
> - Moved this to the second in the series.
> ---
>  include/linux/misc_cgroup.h | 29 +++++++++++++++++++++++++++++
>  kernel/cgroup/misc.c        | 25 ++++++++++++-------------
>  2 files changed, 41 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> index 96a88822815a..87f29f8597e1 100644
> --- a/include/linux/misc_cgroup.h
> +++ b/include/linux/misc_cgroup.h
> @@ -17,6 +17,10 @@ enum misc_res_type {
>  	MISC_CG_RES_SEV,
>  	/* AMD SEV-ES ASIDs resource */
>  	MISC_CG_RES_SEV_ES,
> +#endif
> +#ifdef CONFIG_CGROUP_SGX_EPC
> +	/* SGX EPC memory resource */
> +	MISC_CG_RES_SGX_EPC,
>  #endif
>  	MISC_CG_RES_TYPES
>  };
> @@ -37,6 +41,7 @@ struct misc_res {
>  	u64 max;
>  	atomic64_t usage;
>  	atomic64_t events;
> +	void *priv;
>  
>  	/* per resource callback ops */
>  	int (*alloc)(struct misc_cg *cg);
> @@ -59,6 +64,7 @@ struct misc_cg {
>  	struct misc_res res[MISC_CG_RES_TYPES];
>  };
>  
> +struct misc_cg *misc_cg_root(void);
>  u64 misc_cg_res_total_usage(enum misc_res_type type);
>  int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
>  int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
> @@ -78,6 +84,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
>  	return css ? container_of(css, struct misc_cg, css) : NULL;
>  }
>  
> +/**
> + * misc_cg_parent() - Get the parent of the passed misc cgroup.
> + * @cgroup: cgroup whose parent needs to be fetched.
> + *
> + * Context: Any context.
> + * Return:
> + * * struct misc_cg* - Parent of the @cgroup.
> + * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
> + */
> +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
> +{
> +	return cgroup ? css_misc(cgroup->css.parent) : NULL;
> +}
> +
>  /*
>   * get_current_misc_cg() - Find and get the misc cgroup of the current task.
>   *
> @@ -102,6 +122,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
>  }
>  
>  #else /* !CONFIG_CGROUP_MISC */
> +static inline struct misc_cg *misc_cg_root(void)
> +{
> +	return NULL;
> +}
> +
> +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
> +{
> +	return NULL;
> +}
>  
>  static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
>  {
> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> index 62c9198dee21..4633b8629e63 100644
> --- a/kernel/cgroup/misc.c
> +++ b/kernel/cgroup/misc.c
> @@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
>  	/* AMD SEV-ES ASIDs resource */
>  	"sev_es",
>  #endif
> +#ifdef CONFIG_CGROUP_SGX_EPC
> +	/* Intel SGX EPC memory bytes */
> +	"sgx_epc",
> +#endif
>  };
>  
>  /* Root misc cgroup */
> @@ -40,18 +44,13 @@ static struct misc_cg root_cg;
>  static u64 misc_res_capacity[MISC_CG_RES_TYPES];
>  
>  /**
> - * parent_misc() - Get the parent of the passed misc cgroup.
> - * @cgroup: cgroup whose parent needs to be fetched.
> - *
> - * Context: Any context.
> - * Return:
> - * * struct misc_cg* - Parent of the @cgroup.
> - * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
> + * misc_cg_root() - Return the root misc cgroup.
>   */
> -static struct misc_cg *parent_misc(struct misc_cg *cgroup)
> +struct misc_cg *misc_cg_root(void)
>  {
> -	return cgroup ? css_misc(cgroup->css.parent) : NULL;
> +	return &root_cg;
>  }
> +EXPORT_SYMBOL_GPL(misc_cg_root);
>  
>  /**
>   * valid_type() - Check if @type is valid or not.
> @@ -150,7 +149,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
>  	if (!amount)
>  		return 0;
>  
> -	for (i = cg; i; i = parent_misc(i)) {
> +	for (i = cg; i; i = misc_cg_parent(i)) {
>  		res = &i->res[type];
>  
>  		new_usage = atomic64_add_return(amount, &res->usage);
> @@ -163,12 +162,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
>  	return 0;
>  
>  err_charge:
> -	for (j = i; j; j = parent_misc(j)) {
> +	for (j = i; j; j = misc_cg_parent(j)) {
>  		atomic64_inc(&j->res[type].events);
>  		cgroup_file_notify(&j->events_file);
>  	}
>  
> -	for (j = cg; j != i; j = parent_misc(j))
> +	for (j = cg; j != i; j = misc_cg_parent(j))
>  		misc_cg_cancel_charge(type, j, amount);
>  	misc_cg_cancel_charge(type, i, amount);
>  	return ret;
> @@ -190,7 +189,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
>  	if (!(amount && valid_type(type) && cg))
>  		return;
>  
> -	for (i = cg; i; i = parent_misc(i))
> +	for (i = cg; i; i = misc_cg_parent(i))
>  		misc_cg_cancel_charge(type, i, amount);
>  }
>  EXPORT_SYMBOL_GPL(misc_cg_uncharge);


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 08/18] x86/sgx: Use a list to track to-be-reclaimed pages
  2023-09-23  3:06   ` Haitao Huang
  (?)
@ 2023-09-28  9:28   ` Huang, Kai
  2023-10-03  5:09     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-09-28  9:28 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> @@ -314,18 +313,22 @@ static void sgx_reclaim_pages(void)
>  
>  		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
>  			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
> -			chunk[cnt++] = epc_page;
> +			list_move_tail(&epc_page->list, &iso);
>  		} else {
> -			/* The owner is freeing the page. No need to add the
> -			 * page back to the list of reclaimable pages.
> +			/* The owner is freeing the page, remove it from the
> +			 * LRU list
>  			 */

Please fix comment style.

>  			sgx_epc_page_reset_state(epc_page);
> +			list_del_init(&epc_page->list);

Is this still needed?  It seems list_del_init() has been done when the EPC page
is taken off from the global active list?


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 11/18] x86/sgx: store unreclaimable pages in LRU lists
  2023-09-23  3:06   ` Haitao Huang
  (?)
  (?)
@ 2023-09-28  9:41   ` Huang, Kai
  2023-10-03  5:15     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-09-28  9:41 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish


> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -746,6 +746,7 @@ void sgx_encl_release(struct kref *ref)
>  	xa_destroy(&encl->page_array);
>  
>  	if (!encl->secs_child_cnt && encl->secs.epc_page) {
> +		sgx_drop_epc_page(encl->secs.epc_page);
>  		sgx_encl_free_epc_page(encl->secs.epc_page);
>  		encl->secs.epc_page = NULL;
>  	}

The "record" of SECS/VA pages should be done together with this.  I see no
reason why the "record" and "drop" are separated into different patches.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-09-27 21:21       ` Huang, Kai
@ 2023-09-29 15:06         ` Haitao Huang
  2023-10-02 11:05           ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-09-29 15:06 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, tglx, tj, Mehta, Sohil, Huang, Kai
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Wed, 27 Sep 2023 16:21:19 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Wed, 2023-09-27 at 10:35 -0500, Haitao Huang wrote:
>> > > +
>> > > +	/* Possible owner types */
>> > > +	union {
>> > > +		struct sgx_encl_page *encl_page;
>> > > +		struct sgx_encl *encl;
>> > > +	};
>> >
>> > Sadly for virtual EPC page the owner is set to the 'sgx_vepc'  
>> instance it
>> > belongs to.
>> >
>> > Given how sgx_{alloc|free}_epc_page() arbitrarily uses encl_page,>  
>> perhaps we
>> > should do below?
>> >
>> >  	union {
>> >  		struct sgx_encl_page *encl_page;
>> >  		struct sgx_encl *encl;
>> >  		struct sgx_vepc *vepc;
>> >  		void *owner;
>> >  	};
>> >
>> > And in sgx_{alloc|free}_epc_page() we can use 'owner' instead.
>> >
>>
>> As I mentioned in cover letter and change log in 11/18, this series does 
>> not track virtual EPC.
>
> It's not about how does the cover letter says.  We cannot ignore the  
> fact that
> currently virtual EPC uses owner too.
>
> But given the virtual EPC code currently doesn't use the owner, I can  
> live with
> not having the 'vepc' member in the union now.

Ah, I forgot even though we don't use the owner field assigned by vepc, it  
is still passed into sgx_alloc/free_epc_page().

Will add back "void* owner" and use it for assignment inside  
sgx_alloc/free_epc_page().

Thanks

Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-09-29 15:06         ` Haitao Huang
@ 2023-10-02 11:05           ` Huang, Kai
  0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-02 11:05 UTC (permalink / raw)
  To: tj, linux-sgx, dave.hansen, x86, cgroups, hpa, mingo,
	linux-kernel, bp, haitao.huang, tglx, jarkko, Mehta, Sohil
  Cc: kristen, Zhang, Bo, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, anakrish, yangjie

On Fri, 2023-09-29 at 10:06 -0500, Haitao Huang wrote:
> On Wed, 27 Sep 2023 16:21:19 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Wed, 2023-09-27 at 10:35 -0500, Haitao Huang wrote:
> > > > > +
> > > > > +	/* Possible owner types */
> > > > > +	union {
> > > > > +		struct sgx_encl_page *encl_page;
> > > > > +		struct sgx_encl *encl;
> > > > > +	};
> > > > 
> > > > Sadly for virtual EPC page the owner is set to the 'sgx_vepc'  
> > > instance it
> > > > belongs to.
> > > > 
> > > > Given how sgx_{alloc|free}_epc_page() arbitrarily uses encl_page,>  
> > > perhaps we
> > > > should do below?
> > > > 
> > > >  	union {
> > > >  		struct sgx_encl_page *encl_page;
> > > >  		struct sgx_encl *encl;
> > > >  		struct sgx_vepc *vepc;
> > > >  		void *owner;
> > > >  	};
> > > > 
> > > > And in sgx_{alloc|free}_epc_page() we can use 'owner' instead.
> > > > 
> > > 
> > > As I mentioned in cover letter and change log in 11/18, this series does 
> > > not track virtual EPC.
> > 
> > It's not about how does the cover letter says.  We cannot ignore the  
> > fact that
> > currently virtual EPC uses owner too.
> > 
> > But given the virtual EPC code currently doesn't use the owner, I can  
> > live with
> > not having the 'vepc' member in the union now.
> 
> Ah, I forgot even though we don't use the owner field assigned by vepc, it  
> is still passed into sgx_alloc/free_epc_page().
> 
> Will add back "void* owner" and use it for assignment inside  
> sgx_alloc/free_epc_page().
> 
> 

And also sgx_setup_epc_section().

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-27  1:56           ` Haitao Huang
@ 2023-10-02 22:47             ` Jarkko Sakkinen
  2023-10-02 22:55               ` Jarkko Sakkinen
  0 siblings, 1 reply; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-10-02 22:47 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

On Wed Sep 27, 2023 at 4:56 AM EEST, Haitao Huang wrote:
> On Tue, 26 Sep 2023 08:13:18 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
> wrote:
>
> ...
> >> > >>  /**
> >> > >> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state
> >> > >> *parent_css)
> >> > >>   */
> >> > >>  static void misc_cg_free(struct cgroup_subsys_state *css)
> >> > >>  {
> >> > >> -	kfree(css_misc(css));
> >> > >> +	struct misc_cg *cg = css_misc(css);
> >> > >> +	enum misc_res_type i;
> >> > >> +
> >> > >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> >> > >> +		if (cg->res[i].free)
> >> > >> +			cg->res[i].free(cg);
> >> > >> +
> >> > >> +	kfree(cg);
> >> > >>  }
> >> > >>
> >> > >>  /* Cgroup controller callbacks */
> >> > >> --
> >> > >> 2.25.1
> >> > >
> >> > > Since the only existing client feature requires all callbacks,  
> >> should
> >> > > this not have that as an invariant?
> >> > >
> >> > > I.e. it might be better to fail unless *all* ops are non-nil (e.g.  
> >> to
> >> > > catch issues in the kernel code).
> >> > >
> >> >
> >> > These callbacks are chained from cgroup_subsys, and they are defined
> >> > separately so it'd be better follow the same pattern.  Or change  
> >> together
> >> > with cgroup_subsys if we want to do that. Reasonable?
> >>
> >> I noticed this one later:
> >>
> >> It would better to create a separate ops struct and declare the instance
> >> as const at minimum.
> >>
> >> Then there is no need for dynamic assigment of ops and all that is in
> >> rodata. This is improves both security and also allows static analysis
> >> bit better.
> >>
> >> Now you have to dynamically trace the struct instance, e.g. in case of
> >> a bug. If this one done, it would be already in the vmlinux.
> >I.e. then in the driver you can have static const struct declaration
> > with *all* pointers pre-assigned.
> >
> > Not sure if cgroups follows this or not but it is *objectively*
> > better. Previous work is not always best possible work...
> >
>
> IIUC, like vm_ops field in vma structs. Although function pointers in  
> vm_ops are assigned statically, but you still need dynamically assign  
> vm_ops for each instance of vma.
>
> So the code will look like this:
>
> if (parent_cg->res[i].misc_ops && parent_cg->res[i].misc_ops->alloc)
> {
> ...
> }
>
> I don't see this is the pattern used in cgroups and no strong opinion  
> either way.
>
> TJ, do you have preference on this?

I do have strong opinion on this. In the client side we want as much
things declared statically as we can because it gives more tools for
statical analysis.

I don't want to see dynamic assignments in the SGX driver, when they
are not actually needed, no matter things are done in cgroups.

> Thanks
> Haitao

BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-10-02 22:47             ` Jarkko Sakkinen
@ 2023-10-02 22:55               ` Jarkko Sakkinen
  2023-10-04 15:45                 ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Jarkko Sakkinen @ 2023-10-02 22:55 UTC (permalink / raw)
  To: Jarkko Sakkinen, Haitao Huang, dave.hansen, tj, linux-kernel,
	linux-sgx, x86, cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

On Tue Oct 3, 2023 at 1:47 AM EEST, Jarkko Sakkinen wrote:
> On Wed Sep 27, 2023 at 4:56 AM EEST, Haitao Huang wrote:
> > On Tue, 26 Sep 2023 08:13:18 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
> > wrote:
> >
> > ...
> > >> > >>  /**
> > >> > >> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state
> > >> > >> *parent_css)
> > >> > >>   */
> > >> > >>  static void misc_cg_free(struct cgroup_subsys_state *css)
> > >> > >>  {
> > >> > >> -	kfree(css_misc(css));
> > >> > >> +	struct misc_cg *cg = css_misc(css);
> > >> > >> +	enum misc_res_type i;
> > >> > >> +
> > >> > >> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> > >> > >> +		if (cg->res[i].free)
> > >> > >> +			cg->res[i].free(cg);
> > >> > >> +
> > >> > >> +	kfree(cg);
> > >> > >>  }
> > >> > >>
> > >> > >>  /* Cgroup controller callbacks */
> > >> > >> --
> > >> > >> 2.25.1
> > >> > >
> > >> > > Since the only existing client feature requires all callbacks,  
> > >> should
> > >> > > this not have that as an invariant?
> > >> > >
> > >> > > I.e. it might be better to fail unless *all* ops are non-nil (e.g.  
> > >> to
> > >> > > catch issues in the kernel code).
> > >> > >
> > >> >
> > >> > These callbacks are chained from cgroup_subsys, and they are defined
> > >> > separately so it'd be better follow the same pattern.  Or change  
> > >> together
> > >> > with cgroup_subsys if we want to do that. Reasonable?
> > >>
> > >> I noticed this one later:
> > >>
> > >> It would better to create a separate ops struct and declare the instance
> > >> as const at minimum.
> > >>
> > >> Then there is no need for dynamic assigment of ops and all that is in
> > >> rodata. This is improves both security and also allows static analysis
> > >> bit better.
> > >>
> > >> Now you have to dynamically trace the struct instance, e.g. in case of
> > >> a bug. If this one done, it would be already in the vmlinux.
> > >I.e. then in the driver you can have static const struct declaration
> > > with *all* pointers pre-assigned.
> > >
> > > Not sure if cgroups follows this or not but it is *objectively*
> > > better. Previous work is not always best possible work...
> > >
> >
> > IIUC, like vm_ops field in vma structs. Although function pointers in  
> > vm_ops are assigned statically, but you still need dynamically assign  
> > vm_ops for each instance of vma.
> >
> > So the code will look like this:
> >
> > if (parent_cg->res[i].misc_ops && parent_cg->res[i].misc_ops->alloc)
> > {
> > ...
> > }
> >
> > I don't see this is the pattern used in cgroups and no strong opinion  
> > either way.
> >
> > TJ, do you have preference on this?
>
> I do have strong opinion on this. In the client side we want as much
> things declared statically as we can because it gives more tools for
> statical analysis.
>
> I don't want to see dynamic assignments in the SGX driver, when they
> are not actually needed, no matter things are done in cgroups.

I.e. I don't really even care what crazy things cgroups subsystem
might do or not do. It's not my problem.

All I care is that we *do not* have any use for assigning those
pointers at run-time. So do whatever you want with cgroups side
as long as this is not the case.

BR, Jarkko

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 06/18] x86/sgx: Introduce EPC page states
  2023-09-27 10:28   ` Huang, Kai
@ 2023-10-03  4:49     ` Haitao Huang
  2023-10-03 20:03       ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-03  4:49 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Wed, 27 Sep 2023 05:28:36 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> Use the lower 3 bits in the flags field of sgx_epc_page struct to
>> track EPC states in its life cycle and define an enum for possible
>> states. More state(s) will be added later.
>
> This patch does more than what the changelog claims to do.  AFAICT it  
> does
> below:
>
>  1) Use the lower 3 bits to track EPC page status
>  2) Rename SGX_EPC_PAGE_RECLAIMER_TRACKED to SGX_EPC_PAGE_RERCLAIMABLE
>  3) Introduce a new state SGX_EPC_PAGE_UNRECLAIMABLE
>  4) Track SECS and VA pages as SGX_EPC_PAGE_UNRECLAIMABLE
>
> The changelog only says 1) IIUC.
>
I don't quite get why you would view 3) as a separate item from 1).
In my view, 4) is not done as long as there is not separate list to track  
it.
Maybe I should make it clear the "states" vs "tracking". States are just  
bits in the flags, "tracking" is done using the lists by ksgxd/cgroup. And  
this patch is really about "states"
Would that clarify the intention of the patch?

> If we really want to do all these in one patch, then the changelog  
> should at
> least mention the justification of all of them.
>
> But I don't see why 3) and 4) need to be done here.  Instead, IMHO they  
> should
> be done in a separate patch, and do it after the unreclaimable list is
> introduced (or you need to bring that patch forward).
>
>
> For instance, ...
>
> [snip]
>
>> +
>> +	/* Page is in use but tracked in an unreclaimable LRU list. These are
>> +	 * only reclaimable when the whole enclave is OOM killed or the  
>> enclave
>> +	 * is released, e.g., VA, SECS pages
>> +	 * Becomes NOT_TRACKED after sgx_drop_epc()
>> +	 */
>> +	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
>
> ... We even don't have the unreclaimable LRU list yet.  It's odd to have  
> this
> comment here.
>

Yeah, I should take out the mentioning of the LRUs from definitions of the  
states.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 08/18] x86/sgx: Use a list to track to-be-reclaimed pages
  2023-09-28  9:28   ` Huang, Kai
@ 2023-10-03  5:09     ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-03  5:09 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Thu, 28 Sep 2023 04:28:34 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> @@ -314,18 +313,22 @@ static void sgx_reclaim_pages(void)
>>   		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
>>  			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
>> -			chunk[cnt++] = epc_page;
>> +			list_move_tail(&epc_page->list, &iso);
>>  		} else {
>> -			/* The owner is freeing the page. No need to add the
>> -			 * page back to the list of reclaimable pages.
>> +			/* The owner is freeing the page, remove it from the
>> +			 * LRU list
>>  			 */
>
> Please fix comment style.

Sure

>
>>  			sgx_epc_page_reset_state(epc_page);
>> +			list_del_init(&epc_page->list);
>
> Is this still needed?  It seems list_del_init() has been done when the  
> EPC page
> is taken off from the global active list?
>

Good catch. I'll remove it.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 11/18] x86/sgx: store unreclaimable pages in LRU lists
  2023-09-28  9:41   ` Huang, Kai
@ 2023-10-03  5:15     ` Haitao Huang
  2023-10-03 20:12       ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-03  5:15 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Thu, 28 Sep 2023 04:41:33 -0500, Huang, Kai <kai.huang@intel.com> wrote:

>
>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>> @@ -746,6 +746,7 @@ void sgx_encl_release(struct kref *ref)
>>  	xa_destroy(&encl->page_array);
>>
>>  	if (!encl->secs_child_cnt && encl->secs.epc_page) {
>> +		sgx_drop_epc_page(encl->secs.epc_page);
>>  		sgx_encl_free_epc_page(encl->secs.epc_page);
>>  		encl->secs.epc_page = NULL;
>>  	}
>
> The "record" of SECS/VA pages should be done together with this.  I see  
> no
> reason why the "record" and "drop" are separated into different patches.

"record" of SECS/VA pages are done in this patch. Before nothing done in  
"record" for them because no tracking LRU lists for them. Now they are  
tracked.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 11/18] x86/sgx: store unreclaimable pages in LRU lists
  2023-09-27 11:57   ` Huang, Kai
@ 2023-10-03  5:42     ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-03  5:42 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Wed, 27 Sep 2023 06:57:18 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> When an OOM event occurs, all pages associated with an enclave will need
>> to be freed, including pages that are not currently tracked by the
>> cgroup LRU lists.
>
> What are "cgroup LRU lists"?
>
Will reword it. At them moment there is only one global sgx_epc_lru_lists.
>>
>> Add a new "unreclaimable" list to the sgx_epc_lru_lists struct and
>> update the "sgx_record/drop_epc_pages()" functions for adding/removing
>> VA and SECS pages to/from this "unreclaimable" list.
>
> Sorry I don't follow the logic between the two paragraphs.
>
> What is the exact problem?  How does the new "unreclaimable" list solve  
> the
> problem?
>
>
Currently they are not tracked in a list managed by the ksgxd (future  
cgroup worker). So add a list to track them.
Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-09-27 11:35   ` Huang, Kai
@ 2023-10-03  6:45     ` Haitao Huang
  2023-10-03 20:07       ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-03  6:45 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Wed, 27 Sep 2023 06:35:57 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> In a later patch, when a cgroup has exceeded the max capacity for EPC
>> pages, it may need to identify and OOM kill a less active enclave to
>> make room for other enclaves within the same group. Such a victim
>> enclave would have no active pages other than the unreclaimable Version
>> Array (VA) and SECS pages.
>
> What does "no active pages" mean?
>

EPC pages in use.

> A "less active enclave" doesn't necessarily mean it has "no active  
> pages"?
>

I'll rephrase the above sentences

>
>> Therefore, the cgroup needs examine its
> 			^
> 			needs to
>
>> unreclaimable page list, and finding an enclave given a SECS page or a
> 				^
> 				find
>
>> VA page. This will require a backpointer from a page to an enclave,
>> which is not available for VA pages.
>>
>> Because struct sgx_epc_page instances of VA pages are not owned by an
>> sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
>> sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
>> which will store this value in the owner field of the struct
>> sgx_epc_page.
>
> IMHO this paragraph is hard to understand and can be more concise:
>
> One VA page can be shared by multiple enclave pages thus cannot be  
> associated
> with any 'struct sgx_encl_page' instance.  Set the owner of VA page to  
> the
> enclave instead.
>
>

Agreed

>> In a later patch, VA pages will be placed in an
>> unreclaimable queue that can be examined by the cgroup to select the OOM
>> killed enclave.
>
> The code to "place the VA page to unreclaimable queue" has been done in  
> earlier
> patch ("x86/sgx: Introduce EPC page states").  Just the unreclaimable  
> list isn't
> introduced yet.  I think you should just introduce it first then you can  
> get rid
> of those "in a later patch" staff.
>

I hope I was able to clarify to you in other threads that VA pages are not  
placed in any queue/list until [PATCH v5 11/18] x86/sgx: store  
unreclaimable pages in LRU lists.

This patch is the first one to implement tracking for unreclaimable pages.  
I'll add that as a transition hint.

> And nit: please use "unreclaimable list" consistently (not queue).
>

Yes will do

>
> Btw, probably a dumb question:
>
> Theoretically if you only need to find a victim enclave you don't need  
> to put VA
> pages to the unreclaimable list, because those VA pages will be freed  
> anyway
> when enclave is killed.  So keeping VA pages in the list is for  
> accounting all
> the pages that the cgroup is having?

Yes basically tracking them in cgroups as they are allocated.

VAs and SECS may also come and go as swapping/unswapping happens. But if a  
cgroup is OOM, and all reclaimables are gone (swapped out), it'd have to  
reclaim VAs/SECs in the same cgroup starting from the front of the LRU  
list. To reclaim a VA/SECS, it identifies the enclave from the owner of  
the VA/SECS page and kills it, as killing enclave is the only way to  
reclaim VA/SECS pages.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
  2023-09-28  3:59   ` Huang, Kai
@ 2023-10-03  7:00     ` Haitao Huang
  2023-10-03 19:33       ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-03  7:00 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Wed, 27 Sep 2023 22:59:12 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>>
>> Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
>> for the misc controller.
>>
>> Add per resource type private data so that SGX can store additional per
>> cgroup data in misc_cg->misc_cg_res[MISC_CG_RES_SGX_EPC].
>
> To be honest I don't quite understand why putting the above two changes  
> in this
> patch together with exporting misc_cg_root/parent() below.
>
> Any reason why the above two cannot be done together with patch ("  
> x86/sgx:
> Limit process EPC usage with misc cgroup controller"), where these  
> changes are
> actually related?
>
> We all already know that a new EPC misc cgroup will be added.  There's  
> no need
> to actually introduce the new type here only to justify exporting some  
> helper
> functions.
>

I think previous authors intended to separate all prerequisite misc  
changes from SGX changes.
I can combine them if maintainers are fine with it.

>>
>> Export misc_cg_root() so the SGX driver can initialize and add those
>> additional structures to the root misc cgroup as part of initialization
>> for EPC cgroup support. This bootstraps the same additional
>> initialization for non-root cgroups in the 'alloc()' callback added in  
>> the
>> previous patch.
>>
>> The SGX driver, as the EPC memory provider, will have a background
>> worker to reclaim EPC pages to make room for new allocations in the same
>> cgroup when its usage counter reaches near the limit controlled by the
>> cgroup and its ancestors. Therefore it needs to do a walk from the
>> current cgroup up to the root. To enable this walk, move parent_misc()
>> into misc_cgroup.h and make inline to make this function available to
>> SGX, rename it to misc_cg_parent(), and update kernel/cgroup/misc.c to
>> use the new name.
>
> Looks too many details in the above two paragraphs.  Could we have a more
> concise justification for exporting these two functions?
>

This was added to address Jarkko's question, "why does SGX driver need to  
do iterative walks?"
See: https://lore.kernel.org/all/CVHOU5G1SCUT.RCBVZ3W8G2NJ@suppilovahvero/

> And if it were me, I would put it at a relatively later position (e.g.,  
> before
> the patch actually implements EPC cgroup) for better review.  This also  
> applies
> to the first patch.
>

I was told to move all prerequisites to the front or separate out.

https://lore.kernel.org/linux-sgx/CU4H43P3H35X.1BCA3CE4D1250@seitikki/



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-27  9:20   ` Huang, Kai
@ 2023-10-03 14:29     ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-03 14:29 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Wed, 27 Sep 2023 04:20:55 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>>
>> The misc cgroup controller (subsystem) currently does not perform
>> resource type specific action for Cgroups Subsystem State (CSS) events:
>> the 'css_alloc' event when a cgroup is created and the 'css_free' event
>> when a cgroup is destroyed, or in event of user writing the max value to
>> the misc.max file to set the usage limit of a specific resource
>> [admin-guide/cgroup-v2.rst, 5-9. Misc].
>>
>> Define callbacks for those events and allow resource providers to
>> register the callbacks per resource type as needed. This will be
>> utilized later by the EPC misc cgroup support implemented in the SGX
>> driver:
>> - On css_alloc, allocate and initialize necessary structures for EPC
>> reclaiming, e.g., LRU list, work queue, etc.
>> - On css_free, cleanup and free those structures created in alloc.
>> - On max_write, trigger EPC reclaiming if the new limit is at or below
>> current usage.
>
> Nit:
>
> Wondering why we should trigger EPC reclaiming if the new limit is *at*  
> current
> usage?
>
> I actually don't quite care about why here, but writing these details in  
> the
> changelog may bring unnecessary confusion.  I guess you can just remove  
> all the
> details about what SGX driver needs to do on these callbacks.
>
>
Okay, I'll remove the three bullets on the SGX drive implementation  
details.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
  2023-10-03  7:00     ` Haitao Huang
@ 2023-10-03 19:33       ` Huang, Kai
  0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-03 19:33 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, haitao.huang, tglx, tj, Mehta, Sohil
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Tue, 2023-10-03 at 02:00 -0500, Haitao Huang wrote:
> On Wed, 27 Sep 2023 22:59:12 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > From: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > 
> > > Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
> > > for the misc controller.
> > > 
> > > Add per resource type private data so that SGX can store additional per
> > > cgroup data in misc_cg->misc_cg_res[MISC_CG_RES_SGX_EPC].
> > 
> > To be honest I don't quite understand why putting the above two changes  
> > in this
> > patch together with exporting misc_cg_root/parent() below.
> > 
> > Any reason why the above two cannot be done together with patch ("  
> > x86/sgx:
> > Limit process EPC usage with misc cgroup controller"), where these  
> > changes are
> > actually related?
> > 
> > We all already know that a new EPC misc cgroup will be added.  There's  
> > no need
> > to actually introduce the new type here only to justify exporting some  
> > helper
> > functions.
> > 
> 
> I think previous authors intended to separate all prerequisite misc  
> changes from SGX changes.
> I can combine them if maintainers are fine with it.

That's fine.  But IMHO for this particular one I think you are mixing things
together:  Adding SGX EPC resource type and exporting APIs don't have dependency
on the code.

It will be easier to review if you separate this two parts out.  For instance,
at least it's not super clear whether adding a 'priv' is a right move here w/o
looking at the later patches.

Also if you take a look at:

7aef27f0b2a8 ("svm/sev: Register SEV and SEV-ES ASIDs to the misc controller")

Adding the resource type is added together with the implementation.

So I have no problem if you want to split out "adding SGX EPC resource type" out
as a separate patch, but this patch looks should be split. 

> 
> > > 
> > > Export misc_cg_root() so the SGX driver can initialize and add those
> > > additional structures to the root misc cgroup as part of initialization
> > > for EPC cgroup support. This bootstraps the same additional
> > > initialization for non-root cgroups in the 'alloc()' callback added in  
> > > the
> > > previous patch.
> > > 
> > > The SGX driver, as the EPC memory provider, will have a background
> > > worker to reclaim EPC pages to make room for new allocations in the same
> > > cgroup when its usage counter reaches near the limit controlled by the
> > > cgroup and its ancestors. Therefore it needs to do a walk from the
> > > current cgroup up to the root. To enable this walk, move parent_misc()
> > > into misc_cgroup.h and make inline to make this function available to
> > > SGX, rename it to misc_cg_parent(), and update kernel/cgroup/misc.c to
> > > use the new name.
> > 
> > Looks too many details in the above two paragraphs.  Could we have a more
> > concise justification for exporting these two functions?
> > 
> 
> This was added to address Jarkko's question, "why does SGX driver need to  
> do iterative walks?"
> See: https://lore.kernel.org/all/CVHOU5G1SCUT.RCBVZ3W8G2NJ@suppilovahvero/

Agree with Jarkko we need a justification (that what I said above too).  What I
am saying is you can make it more concise.  I can try to do if you want me to.

> 
> > And if it were me, I would put it at a relatively later position (e.g.,  
> > before
> > the patch actually implements EPC cgroup) for better review.  This also  
> > applies
> > to the first patch.
> > 
> 
> I was told to move all prerequisites to the front or separate out.
> 
> https://lore.kernel.org/linux-sgx/CU4H43P3H35X.1BCA3CE4D1250@seitikki/
> 
> 

I don't see any conflict.  Please see the first reply.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 06/18] x86/sgx: Introduce EPC page states
  2023-10-03  4:49     ` Haitao Huang
@ 2023-10-03 20:03       ` Huang, Kai
  2023-10-04 15:24         ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-03 20:03 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, haitao.huang, tglx, tj, Mehta, Sohil
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Mon, 2023-10-02 at 23:49 -0500, Haitao Huang wrote:
> On Wed, 27 Sep 2023 05:28:36 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > Use the lower 3 bits in the flags field of sgx_epc_page struct to
> > > track EPC states in its life cycle and define an enum for possible
> > > states. More state(s) will be added later.
> > 
> > This patch does more than what the changelog claims to do.  AFAICT it  
> > does
> > below:
> > 
> >  1) Use the lower 3 bits to track EPC page status
> >  2) Rename SGX_EPC_PAGE_RECLAIMER_TRACKED to SGX_EPC_PAGE_RERCLAIMABLE
> >  3) Introduce a new state SGX_EPC_PAGE_UNRECLAIMABLE
> >  4) Track SECS and VA pages as SGX_EPC_PAGE_UNRECLAIMABLE
> > 
> > The changelog only says 1) IIUC.
> > 
> I don't quite get why you would view 3) as a separate item from 1).

1) is about using some method to track EPC page status, 3) is adding a new
state.

Why cannot they be separated?

> In my view, 4) is not done as long as there is not separate list to track  
> it.

You are literally doing below:

@@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl, struct
sgx_secs *secs)
 	encl->attributes = secs->attributes;
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
+	sgx_record_epc_page(encl->secs.epc_page,
+			    SGX_EPC_PAGE_UNRECLAIMABLE);
+

Which obviously is tracking SECS as unreclaimable page here.

The only thing you are not doing now is to put to the actual list, which you
introduced in a later patch.

But why not just doing them together?




^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-10-03  6:45     ` Haitao Huang
@ 2023-10-03 20:07       ` Huang, Kai
  2023-10-04 15:03         ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-03 20:07 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, haitao.huang, tglx, tj, Mehta, Sohil
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
> > 
> > Btw, probably a dumb question:
> > 
> > Theoretically if you only need to find a victim enclave you don't need  
> > to put VA
> > pages to the unreclaimable list, because those VA pages will be freed  
> > anyway
> > when enclave is killed.  So keeping VA pages in the list is for  
> > accounting all
> > the pages that the cgroup is having?
> 
> Yes basically tracking them in cgroups as they are allocated.
> 
> VAs and SECS may also come and go as swapping/unswapping happens. But if a  
> cgroup is OOM, and all reclaimables are gone (swapped out), it'd have to  
> reclaim VAs/SECs in the same cgroup starting from the front of the LRU  
> list. To reclaim a VA/SECS, it identifies the enclave from the owner of  
> the VA/SECS page and kills it, as killing enclave is the only way to  
> reclaim VA/SECS pages.

To kill enclave you just need to track SECS in  the unreclaimable list.  

Only when you want to account the total EPC pages via some list you _probably_
need to track VA as well.  But I am not quite sure about this either.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 11/18] x86/sgx: store unreclaimable pages in LRU lists
  2023-10-03  5:15     ` Haitao Huang
@ 2023-10-03 20:12       ` Huang, Kai
  0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-03 20:12 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, haitao.huang, tglx, tj, Mehta, Sohil
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Tue, 2023-10-03 at 00:15 -0500, Haitao Huang wrote:
> On Thu, 28 Sep 2023 04:41:33 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > 
> > > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > > @@ -746,6 +746,7 @@ void sgx_encl_release(struct kref *ref)
> > >  	xa_destroy(&encl->page_array);
> > > 
> > >  	if (!encl->secs_child_cnt && encl->secs.epc_page) {
> > > +		sgx_drop_epc_page(encl->secs.epc_page);
> > >  		sgx_encl_free_epc_page(encl->secs.epc_page);
> > >  		encl->secs.epc_page = NULL;
> > >  	}
> > 
> > The "record" of SECS/VA pages should be done together with this.  I see  
> > no
> > reason why the "record" and "drop" are separated into different patches.
> 
> "record" of SECS/VA pages are done in this patch. Before nothing done in  
> "record" for them because no tracking LRU lists for them. Now they are  
> tracked.
> 
> 

I was talking about calling sgx_record_epc_page() for SECS/VA:

@@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl, struct
sgx_secs *secs)
 	encl->attributes = secs->attributes;
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
+	sgx_record_epc_page(encl->secs.epc_page,
+			    SGX_EPC_PAGE_UNRECLAIMABLE);

This piece of code *literally* does the record.




^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-10-03 20:07       ` Huang, Kai
@ 2023-10-04 15:03         ` Haitao Huang
  2023-10-04 21:13           ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-04 15:03 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, tglx, tj, Mehta, Sohil, Huang, Kai
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Tue, 03 Oct 2023 15:07:42 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
>> >
>> > Btw, probably a dumb question:
>> >
>> > Theoretically if you only need to find a victim enclave you don't need 
>> > to put VA
>> > pages to the unreclaimable list, because those VA pages will be freed 
>> > anyway
>> > when enclave is killed.  So keeping VA pages in the list is for>  
>> accounting all
>> > the pages that the cgroup is having?
>>
>> Yes basically tracking them in cgroups as they are allocated.
>>
>> VAs and SECS may also come and go as swapping/unswapping happens. But  
>> if acgroup is OOM, and all reclaimables are gone (swapped out), it'd  
>> have toreclaim VAs/SECs in the same cgroup starting from the front of  
>> the LRUlist. To reclaim a VA/SECS, it identifies the enclave from the  
>> owner ofthe VA/SECS page and kills it, as killing enclave is the only  
>> way toreclaim VA/SECS pages.
>
> To kill enclave you just need to track SECS in  the unreclaimable list.  
> Only when you want to account the total EPC pages via some list you  
> _probably_
> need to track VA as well.  But I am not quite sure about this either.

There is a case where even SECS is paged out for an enclave with all  
reclaimables out. So cgroup needs to track each page used by an enclave  
and kill enclave when cgroup needs to lower usage by evicting an VA or  
SECS page.
There were some discussion on paging out VAs without killing enclaves but  
it'd be complicated and not implemented yet.

BTW, I need clarify tracking pages which is done by LRUs vs usage  
accounting which is done by charge/uncharge to misc. To me tracking is for  
reclaiming not accounting. Also vEPCs not tracked at all but they are  
accounted for.

Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 06/18] x86/sgx: Introduce EPC page states
  2023-10-03 20:03       ` Huang, Kai
@ 2023-10-04 15:24         ` Haitao Huang
  2023-10-04 21:05           ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-04 15:24 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, tglx, tj, Mehta, Sohil, Huang, Kai
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Tue, 03 Oct 2023 15:03:48 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-10-02 at 23:49 -0500, Haitao Huang wrote:
>> On Wed, 27 Sep 2023 05:28:36 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> > > Use the lower 3 bits in the flags field of sgx_epc_page struct to
>> > > track EPC states in its life cycle and define an enum for possible
>> > > states. More state(s) will be added later.
>> >
>> > This patch does more than what the changelog claims to do.  AFAICT it
>> > does
>> > below:
>> >
>> >  1) Use the lower 3 bits to track EPC page status
>> >  2) Rename SGX_EPC_PAGE_RECLAIMER_TRACKED to SGX_EPC_PAGE_RERCLAIMABLE
>> >  3) Introduce a new state SGX_EPC_PAGE_UNRECLAIMABLE
>> >  4) Track SECS and VA pages as SGX_EPC_PAGE_UNRECLAIMABLE
>> >
>> > The changelog only says 1) IIUC.
>> >
>> I don't quite get why you would view 3) as a separate item from 1).
>
> 1) is about using some method to track EPC page status, 3) is adding a  
> new
> state.
>
> Why cannot they be separated?
>
>> In my view, 4) is not done as long as there is not separate list to  
>> track
>> it.
>
> You are literally doing below:
>
> @@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl,  
> struct
> sgx_secs *secs)
>  	encl->attributes = secs->attributes;
>  	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
> +	sgx_record_epc_page(encl->secs.epc_page,
> +			    SGX_EPC_PAGE_UNRECLAIMABLE);
> +
>
> Which obviously is tracking SECS as unreclaimable page here.
>
> The only thing you are not doing now is to put to the actual list, which  
> you
> introduced in a later patch.
>
> But why not just doing them together?
>
>
I see where the problem is now.  Initially these states are bit masks so  
UNTRACKED and UNRECLAIMABLE are all not masked (set zero). I'll change  
these "record" calls with UNTRACKED instead, and later replace with  
UNRECLAIMABLE when they are actually added to the list. So UNRECLAIMABLE  
state can also be delayed until that patch with the list added.

Thanks.
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-10-02 22:55               ` Jarkko Sakkinen
@ 2023-10-04 15:45                 ` Haitao Huang
  2023-10-04 17:18                   ` Tejun Heo
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-04 15:45 UTC (permalink / raw)
  To: Jarkko Sakkinen, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen, yangjie

Hi Jarkko

On Mon, 02 Oct 2023 17:55:14 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:
...
>> > >> I noticed this one later:
>> > >>
>> > >> It would better to create a separate ops struct and declare the  
>> instance
>> > >> as const at minimum.
>> > >>
>> > >> Then there is no need for dynamic assigment of ops and all that is  
>> in
>> > >> rodata. This is improves both security and also allows static  
>> analysis
>> > >> bit better.
>> > >>
>> > >> Now you have to dynamically trace the struct instance, e.g. in  
>> case of
>> > >> a bug. If this one done, it would be already in the vmlinux.
>> > >I.e. then in the driver you can have static const struct declaration
>> > > with *all* pointers pre-assigned.
>> > >
>> > > Not sure if cgroups follows this or not but it is *objectively*
>> > > better. Previous work is not always best possible work...
>> > >
>> >
>> > IIUC, like vm_ops field in vma structs. Although function pointers in
>> > vm_ops are assigned statically, but you still need dynamically assign
>> > vm_ops for each instance of vma.
>> >
>> > So the code will look like this:
>> >
>> > if (parent_cg->res[i].misc_ops && parent_cg->res[i].misc_ops->alloc)
>> > {
>> > ...
>> > }
>> >
>> > I don't see this is the pattern used in cgroups and no strong opinion
>> > either way.
>> >
>> > TJ, do you have preference on this?
>>
>> I do have strong opinion on this. In the client side we want as much
>> things declared statically as we can because it gives more tools for
>> statical analysis.
>>
>> I don't want to see dynamic assignments in the SGX driver, when they
>> are not actually needed, no matter things are done in cgroups.
>
> I.e. I don't really even care what crazy things cgroups subsystem
> might do or not do. It's not my problem.
>
> All I care is that we *do not* have any use for assigning those
> pointers at run-time. So do whatever you want with cgroups side
> as long as this is not the case.
>


So I will update to something like following. Let me know if that's  
correct understanding.
@tj, I'd appreciate for your input on whether this is acceptable from  
cgroups side.

--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -31,22 +31,26 @@ struct misc_cg;

  #include <linux/cgroup.h>

+/* per resource callback ops */
+struct misc_operations_struct {
+       int (*alloc)(struct misc_cg *cg);
+       void (*free)(struct misc_cg *cg);
+       void (*max_write)(struct misc_cg *cg);
+};
  /**
   * struct misc_res: Per cgroup per misc type resource
   * @max: Maximum limit on the resource.
   * @usage: Current usage of the resource.
   * @events: Number of times, the resource limit exceeded.
+ * @priv: resource specific data.
+ * @misc_ops: resource specific operations.
   */
  struct misc_res {
         u64 max;
         atomic64_t usage;
         atomic64_t events;
         void *priv;
-
-       /* per resource callback ops */
-       int (*alloc)(struct misc_cg *cg);
-       void (*free)(struct misc_cg *cg);
-       void (*max_write)(struct misc_cg *cg);
+       const struct misc_operations_struct *misc_ops;
  };

...
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 4633b8629e63..500415087643 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -277,8 +277,8 @@ static ssize_t misc_cg_max_write(struct  
kernfs_open_file *of, char *buf,

         if (READ_ONCE(misc_res_capacity[type])) {
                 WRITE_ONCE(cg->res[type].max, max);
-               if (cg->res[type].max_write)
-                       cg->res[type].max_write(cg);
+               if (cg->res[type].misc_ops &&  
cg->res[type].misc_ops->max_write)
+                       cg->res[type].misc_ops->max_write(cg);

[skip other similar changes in misc.c]

And on SGX side, it'll be updated like this:

--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -376,6 +376,14 @@ static void sgx_epc_cgroup_max_write(struct misc_cg  
*cg)
         queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
  }

+static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
+
+const struct misc_operations_struct sgx_epc_cgroup_ops = {
+        .alloc = sgx_epc_cgroup_alloc,
+        .free = sgx_epc_cgroup_free,
+        .max_write = sgx_epc_cgroup_max_write,
+};
+
  static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
  {
         struct sgx_epc_cgroup *epc_cg;
@@ -386,12 +394,7 @@ static int sgx_epc_cgroup_alloc(struct misc_cg *cg)

         sgx_lru_init(&epc_cg->lru);
         INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
-       cg->res[MISC_CG_RES_SGX_EPC].alloc = sgx_epc_cgroup_alloc;
-       cg->res[MISC_CG_RES_SGX_EPC].free = sgx_epc_cgroup_free;
-       cg->res[MISC_CG_RES_SGX_EPC].max_write = sgx_epc_cgroup_max_write;
-       cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
-       epc_cg->cg = cg;
-
+       cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops;
         return 0;
  }


Thanks again to all of you for feedback.

Haitao

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-10-04 15:45                 ` Haitao Huang
@ 2023-10-04 17:18                   ` Tejun Heo
  0 siblings, 0 replies; 144+ messages in thread
From: Tejun Heo @ 2023-10-04 17:18 UTC (permalink / raw)
  To: Haitao Huang
  Cc: Jarkko Sakkinen, dave.hansen, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta, zhiquan1.li, kristen,
	seanjc, zhanb, anakrish, mikko.ylinen, yangjie

Hello,

On Wed, Oct 04, 2023 at 10:45:08AM -0500, Haitao Huang wrote:
> So I will update to something like following. Let me know if that's correct
> understanding.
> @tj, I'd appreciate for your input on whether this is acceptable from
> cgroups side.

Yeah, that's fine by me and I can't tell what actual differences the two
would have in practice.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 06/18] x86/sgx: Introduce EPC page states
  2023-10-04 15:24         ` Haitao Huang
@ 2023-10-04 21:05           ` Huang, Kai
  0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-04 21:05 UTC (permalink / raw)
  To: tj, linux-sgx, dave.hansen, x86, cgroups, hpa, mingo,
	linux-kernel, bp, haitao.huang, tglx, jarkko, Mehta, Sohil
  Cc: kristen, Zhang, Bo, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, anakrish, yangjie

On Wed, 2023-10-04 at 10:24 -0500, Haitao Huang wrote:
> On Tue, 03 Oct 2023 15:03:48 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Mon, 2023-10-02 at 23:49 -0500, Haitao Huang wrote:
> > > On Wed, 27 Sep 2023 05:28:36 -0500, Huang, Kai <kai.huang@intel.com>  
> > > wrote:
> > > 
> > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > > Use the lower 3 bits in the flags field of sgx_epc_page struct to
> > > > > track EPC states in its life cycle and define an enum for possible
> > > > > states. More state(s) will be added later.
> > > > 
> > > > This patch does more than what the changelog claims to do.  AFAICT it
> > > > does
> > > > below:
> > > > 
> > > >  1) Use the lower 3 bits to track EPC page status
> > > >  2) Rename SGX_EPC_PAGE_RECLAIMER_TRACKED to SGX_EPC_PAGE_RERCLAIMABLE
> > > >  3) Introduce a new state SGX_EPC_PAGE_UNRECLAIMABLE
> > > >  4) Track SECS and VA pages as SGX_EPC_PAGE_UNRECLAIMABLE
> > > > 
> > > > The changelog only says 1) IIUC.
> > > > 
> > > I don't quite get why you would view 3) as a separate item from 1).
> > 
> > 1) is about using some method to track EPC page status, 3) is adding a  
> > new
> > state.
> > 
> > Why cannot they be separated?
> > 
> > > In my view, 4) is not done as long as there is not separate list to  
> > > track
> > > it.
> > 
> > You are literally doing below:
> > 
> > @@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl,  
> > struct
> > sgx_secs *secs)
> >  	encl->attributes = secs->attributes;
> >  	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
> > +	sgx_record_epc_page(encl->secs.epc_page,
> > +			    SGX_EPC_PAGE_UNRECLAIMABLE);
> > +
> > 
> > Which obviously is tracking SECS as unreclaimable page here.
> > 
> > The only thing you are not doing now is to put to the actual list, which  
> > you
> > introduced in a later patch.
> > 
> > But why not just doing them together?
> > 
> > 
> I see where the problem is now.  Initially these states are bit masks so  
> UNTRACKED and UNRECLAIMABLE are all not masked (set zero). I'll change  
> these "record" calls with UNTRACKED instead, and later replace with  
> UNRECLAIMABLE when they are actually added to the list. So UNRECLAIMABLE  
> state can also be delayed until that patch with the list added.

I am not sure whether I am following, but could we just delay introducing the
"untracked" or "unreclaimable" until the list is added?

Why do we need to call sgx_record_epc_page() for SECS and VA pages in _this_
patch?

Reading again, I _think_ the reason why you added these new states because you
want to justify using the low 3 bits as EPC page states, i.e., below code ...

	+#define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)

But for now we only have two valid states:

	- SGX_EPC_PAGE_IS_FREE
	- SGX_EPC_PAGE_RECLAIMER_TRACKED

Thus you added two more states: NOT_TRACKED/UNRECLAIMABLE.  And more
confusingly, you added calling sgx_record_epc_page() for SECS and VA pages in
this patch to try to actually use these new states, while the changelog says:

	Use the lower 3 bits in the flags field of sgx_epc_page struct to
	track EPC states in its life cycle and define an enum for possible
	states. More state(s) will be added later.

... which doesn't mention any of above.

But this doesn't stand either because you only need 2 bits for the four states
but not 3 bits.  So I don't see how adding the new states could help here.

So I would suggest two options:

1) 

In this patch, you only change the way to track EPC states to reflect your
changelog of this patch (maybe you can add NOT_TRACKED, but I am not sure /
don't care, just give a justification if you do).

And then you have a patch to introduce the new unreclaimable list, the new EPC
state, and call sgx_record_epc_page() *AND* sgx_drop_epc_page() for SECS/VA
pages.  The patch could be titled such as:

	x86/sgx: Store SECS/VA pages in the unreclaimable list

2)

You do the opposite to option 1): Introduce the patch 

	x86/sgx: Store SECS/VA pages in the unreclaimable list

... first, and then convert the EPC states to using the lower 3 bits (actually 2
bits is enough, you can extend to 3 bits in later patch when needed).

Does above make more sense?


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-10-04 15:03         ` Haitao Huang
@ 2023-10-04 21:13           ` Huang, Kai
  2023-10-05  4:22             ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-04 21:13 UTC (permalink / raw)
  To: tj, linux-sgx, dave.hansen, x86, cgroups, hpa, mingo,
	linux-kernel, bp, haitao.huang, tglx, jarkko, Mehta, Sohil
  Cc: kristen, Zhang, Bo, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, anakrish, yangjie

On Wed, 2023-10-04 at 10:03 -0500, Haitao Huang wrote:
> On Tue, 03 Oct 2023 15:07:42 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
> > > > 
> > > > Btw, probably a dumb question:
> > > > 
> > > > Theoretically if you only need to find a victim enclave you don't need 
> > > > to put VA
> > > > pages to the unreclaimable list, because those VA pages will be freed 
> > > > anyway
> > > > when enclave is killed.  So keeping VA pages in the list is for>  
> > > accounting all
> > > > the pages that the cgroup is having?
> > > 
> > > Yes basically tracking them in cgroups as they are allocated.
> > > 
> > > VAs and SECS may also come and go as swapping/unswapping happens. But  
> > > if acgroup is OOM, and all reclaimables are gone (swapped out), it'd  
> > > have toreclaim VAs/SECs in the same cgroup starting from the front of  
> > > the LRUlist. To reclaim a VA/SECS, it identifies the enclave from the  
> > > owner ofthe VA/SECS page and kills it, as killing enclave is the only  
> > > way toreclaim VA/SECS pages.
> > 
> > To kill enclave you just need to track SECS in  the unreclaimable list.  
> > Only when you want to account the total EPC pages via some list you  
> > _probably_
> > need to track VA as well.  But I am not quite sure about this either.
> 
> There is a case where even SECS is paged out for an enclave with all  
> reclaimables out. 
> 

Yes.  But this essentially means these enclaves are not active, thus shouldn't
be the victim of OOM?

> So cgroup needs to track each page used by an enclave  
> and kill enclave when cgroup needs to lower usage by evicting an VA or  
> SECS page.

Let's discuss more on tracking SECS on unreclaimable list only.

Could we assume that when the OOM wants to pick up a victim to serve the new
enclave, there must be at least another one *active* enclave which still has the
SECS page in EPC?

If yes, that enclave will be selected as victim.

If not, then no other enclave will be selected as victim.  Instead, only the new
enclave which is requesting more EPC will be selected, because it's SECS is on
the unreclaimable list.

Somehow this is unacceptable, thus we need to track VA pages too in order to
kill other inactive enclave?

> There were some discussion on paging out VAs without killing enclaves but  
> it'd be complicated and not implemented yet.

No we don't involve swapping VA pages now.  It's a separate topic.

> 
> BTW, I need clarify tracking pages which is done by LRUs vs usage  
> accounting which is done by charge/uncharge to misc. To me tracking is for  
> reclaiming not accounting. Also vEPCs not tracked at all but they are  
> accounted for.

I'll review the rest patches.  Thanks.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-10-04 21:13           ` Huang, Kai
@ 2023-10-05  4:22             ` Haitao Huang
  2023-10-05  6:49               ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-05  4:22 UTC (permalink / raw)
  To: tj, linux-sgx, dave.hansen, x86, cgroups, hpa, mingo,
	linux-kernel, bp, tglx, jarkko, Mehta, Sohil, Huang, Kai
  Cc: kristen, Zhang, Bo, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, anakrish, yangjie

On Wed, 04 Oct 2023 16:13:41 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Wed, 2023-10-04 at 10:03 -0500, Haitao Huang wrote:
>> On Tue, 03 Oct 2023 15:07:42 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
>> > > >
>> > > > Btw, probably a dumb question:
>> > > >
>> > > > Theoretically if you only need to find a victim enclave you don't  
>> need
>> > > > to put VA
>> > > > pages to the unreclaimable list, because those VA pages will be  
>> freed
>> > > > anyway
>> > > > when enclave is killed.  So keeping VA pages in the list is for>
>> > > accounting all
>> > > > the pages that the cgroup is having?
>> > >
>> > > Yes basically tracking them in cgroups as they are allocated.
>> > >
>> > > VAs and SECS may also come and go as swapping/unswapping happens.  
>> But
>> > > if acgroup is OOM, and all reclaimables are gone (swapped out), it'd
>> > > have toreclaim VAs/SECs in the same cgroup starting from the front  
>> of
>> > > the LRUlist. To reclaim a VA/SECS, it identifies the enclave from  
>> the
>> > > owner ofthe VA/SECS page and kills it, as killing enclave is the  
>> only
>> > > way toreclaim VA/SECS pages.
>> >
>> > To kill enclave you just need to track SECS in  the unreclaimable  
>> list.
>> > Only when you want to account the total EPC pages via some list you
>> > _probably_
>> > need to track VA as well.  But I am not quite sure about this either.
>>
>> There is a case where even SECS is paged out for an enclave with all
>> reclaimables out.
>
> Yes.  But this essentially means these enclaves are not active, thus  
> shouldn't
> be the victim of OOM?
>

But there are VA pages for the enclave at that moment. So it can be  
candidate for OOM victim.

>> So cgroup needs to track each page used by an enclave
>> and kill enclave when cgroup needs to lower usage by evicting an VA or
>> SECS page.
>
> Let's discuss more on tracking SECS on unreclaimable list only.
>
> Could we assume that when the OOM wants to pick up a victim to serve the  
> new
> enclave, there must be at least another one *active* enclave which still  
> has the
> SECS page in EPC?
>
No, at a given instant when OOM happens, "active" enclave's SECS may not  
be in EPC, but lots of VAs.

OOM := "no reclaimable pages left in the cgroup to reclaim and total usage  
is still at/near limit".



> If yes, that enclave will be selected as victim.
>
> If not, then no other enclave will be selected as victim.  Instead, only  
> the new
> enclave which is requesting more EPC will be selected, because it's SECS  
> is on
> the unreclaimable list.
>

You can't assume the requesting enclave's SECS is in unreclaimable list  
either. Think the request is from #PF in the scenario we fixed the NULL  
pointer of SECS by reloading it.

> Somehow this is unacceptable, thus we need to track VA pages too in  
> order to
> kill other inactive enclave?
>

If we know for sure SECS will always be in EPC, thus tracked in  
unreclaimables, then we probably can do it (see below).
I hope the reason given above is clear.

>> There were some discussion on paging out VAs without killing enclaves  
>> but
>> it'd be complicated and not implemented yet.
>
> No we don't involve swapping VA pages now.  It's a separate topic.
>
Only mentioned it as a kind of constraints impacting current design.

Another potential alternative: we don't reclaim SECS either until OOM and  
only track SECS pages for cgroups. But that would change current behavior.  
And I'm not sure about other consequences, e.g., enclaves theoretically  
can allocate pages (including VA pages) in different cgroups/processes, so  
we may still end up tracking all VA pages for cgroups or we track SECS  
page in all cgroups in which enclave allocated any pages. Let me know your  
thoughts.

>>
>> BTW, I need clarify tracking pages which is done by LRUs vs usage
>> accounting which is done by charge/uncharge to misc. To me tracking is  
>> for
>> reclaiming not accounting. Also vEPCs not tracked at all but they are
>> accounted for.
>
> I'll review the rest patches.  Thanks.


Thank you!
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-10-05  4:22             ` Haitao Huang
@ 2023-10-05  6:49               ` Huang, Kai
  0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-05  6:49 UTC (permalink / raw)
  To: Mehta, Sohil, linux-sgx, x86, dave.hansen, cgroups, hpa, mingo,
	tj, bp, haitao.huang, tglx, jarkko, linux-kernel
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Wed, 2023-10-04 at 23:22 -0500, Haitao Huang wrote:
> On Wed, 04 Oct 2023 16:13:41 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Wed, 2023-10-04 at 10:03 -0500, Haitao Huang wrote:
> > > On Tue, 03 Oct 2023 15:07:42 -0500, Huang, Kai <kai.huang@intel.com>  
> > > wrote:
> > > 
> > > > On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
> > > > > > 
> > > > > > Btw, probably a dumb question:
> > > > > > 
> > > > > > Theoretically if you only need to find a victim enclave you don't  
> > > need
> > > > > > to put VA
> > > > > > pages to the unreclaimable list, because those VA pages will be  
> > > freed
> > > > > > anyway
> > > > > > when enclave is killed.  So keeping VA pages in the list is for>
> > > > > accounting all
> > > > > > the pages that the cgroup is having?
> > > > > 
> > > > > Yes basically tracking them in cgroups as they are allocated.
> > > > > 
> > > > > VAs and SECS may also come and go as swapping/unswapping happens.  
> > > But
> > > > > if acgroup is OOM, and all reclaimables are gone (swapped out), it'd
> > > > > have toreclaim VAs/SECs in the same cgroup starting from the front  
> > > of
> > > > > the LRUlist. To reclaim a VA/SECS, it identifies the enclave from  
> > > the
> > > > > owner ofthe VA/SECS page and kills it, as killing enclave is the  
> > > only
> > > > > way toreclaim VA/SECS pages.
> > > > 
> > > > To kill enclave you just need to track SECS in  the unreclaimable  
> > > list.
> > > > Only when you want to account the total EPC pages via some list you
> > > > _probably_
> > > > need to track VA as well.  But I am not quite sure about this either.
> > > 
> > > There is a case where even SECS is paged out for an enclave with all
> > > reclaimables out.
> > 
> > Yes.  But this essentially means these enclaves are not active, thus  
> > shouldn't
> > be the victim of OOM?
> > 
> 
> But there are VA pages for the enclave at that moment. So it can be  
> candidate for OOM victim.

Yes.  I am not familiar with how does OOM choose victim, but it seems choosing
inactive enclaves seems more reasonable.


[...]

> > > There were some discussion on paging out VAs without killing enclaves  
> > > but
> > > it'd be complicated and not implemented yet.
> > 
> > No we don't involve swapping VA pages now.  It's a separate topic.
> > 
> Only mentioned it as a kind of constraints impacting current design.
> 
> Another potential alternative: we don't reclaim SECS either until OOM and  
> only track SECS pages for cgroups. But that would change current behavior.  
> And I'm not sure about other consequences, e.g., enclaves theoretically  
> can allocate pages (including VA pages) in different cgroups/processes, so  
> we may still end up tracking all VA pages for cgroups or we track SECS  
> page in all cgroups in which enclave allocated any pages. Let me know your  
> thoughts.

Let's not change current behaviour.  I seriously doubt that is needed.

So it seems to me that what we need is just some way to let the OOM find some
victim enclave.  I am not sure whether "tracking EPC pages in some lists" has
anything to do with cgroup accounting EPC pages, so will take a look the rest of
the patches.



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  2023-09-23  3:06   ` Haitao Huang
  (?)
@ 2023-10-05 12:24   ` Huang, Kai
  2023-10-05 19:23     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-05 12:24 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Adjust and expose the top-level reclaim function as
> sgx_reclaim_epc_pages() for use by the upcoming EPC cgroup, which will
> initiate reclaim to enforce the max limit.
> 
> Make these adjustments to the function signature.
> 
> 1) To take a parameter that specifies the number of pages to scan for
> reclaiming. Define a max value of 32, but scan 16 in the case for the
> global reclaimer (ksgxd). The EPC cgroup will use it to specify a
> desired number of pages to be reclaimed up to the max value of 32.
> 
> 2) To take a flag to force reclaiming a page regardless of its age.  The
> EPC cgroup will use the flag to enforce its limits by draining the
> reclaimable lists before resorting to other measures, e.g. forcefully
> kill enclaves.
> 
> 3) Return the number of reclaimed pages. The EPC cgroup will use the
> result to track reclaiming progress and escalate to a more forceful
> reclaiming mode, e.g., calling this function with the flag to ignore age
> of pages.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
> V4:
> - Combined the 3 patches that made the individual changes to the
> function signature.
> - Removed 'high' limit in commit message.
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 31 +++++++++++++++++++++----------
>  arch/x86/kernel/cpu/sgx/sgx.h  |  1 +
>  2 files changed, 22 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 3b875ab4dcd0..4e1a3e038db5 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -18,6 +18,11 @@
>  #include "encl.h"
>  #include "encls.h"
>  
> +/*
> + * Maximum number of pages to scan for reclaiming.
> + */
> +#define SGX_NR_TO_SCAN_MAX	32
> +
>  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  static int sgx_nr_epc_sections;
>  static struct task_struct *ksgxd_tsk;
> @@ -279,7 +284,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>  	mutex_unlock(&encl->lock);
>  }
>  
> -/*
> +/**
> + * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
> + * @nr_to_scan:		 Number of EPC pages to scan for reclaim
> + * @ignore_age:		 Reclaim a page even if it is young
> + *
>   * Take a fixed number of pages from the head of the active page pool and
>   * reclaim them to the enclave's private shmem files. Skip the pages, which have
>   * been accessed since the last scan. Move those pages to the tail of active
> @@ -292,15 +301,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -static void sgx_reclaim_pages(void)
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)

'size_t' looks odd.  Any reason to use it?

Given you only scan 32 at maximum, seems 'int' is good enough?

>  {
> -	struct sgx_backing backing[SGX_NR_TO_SCAN];
> +	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>  	struct sgx_epc_page *epc_page, *tmp;
>  	struct sgx_encl_page *encl_page;
>  	pgoff_t page_index;
>  	LIST_HEAD(iso);
> -	int ret;
> -	int i;
> +	size_t ret, i;
>  
>  	spin_lock(&sgx_global_lru.lock);
>  	for (i = 0; i < SGX_NR_TO_SCAN; i++) {

The function comment says 

	* @nr_to_scan:		 Number of EPC pages to scan for reclaim

But I don't see it is even used, if my eye isn't deceiving me?
	
> @@ -326,13 +334,14 @@ static void sgx_reclaim_pages(void)
>  	spin_unlock(&sgx_global_lru.lock);
>  
>  	if (list_empty(&iso))
> -		return;
> +		return 0;
>  
>  	i = 0;
>  	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
>  		encl_page = epc_page->encl_page;
>  
> -		if (!sgx_reclaimer_age(epc_page))
> +		if (i == SGX_NR_TO_SCAN_MAX ||

i == nr_to_scan?

And should we have a

	if (nr_to_scan < SGX_NR_TO_SCAN_MAX)
		return 0;

at the very beginning of this function?

> +		    (!ignore_age && !sgx_reclaimer_age(epc_page)))
>  			goto skip;
>  
>  		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
> @@ -371,6 +380,8 @@ static void sgx_reclaim_pages(void)
>  
>  		sgx_free_epc_page(epc_page);
>  	}
> +
> +	return i;
>  }
>  

I found this function a little bit odd, given the mixing of 'nr_to_scan',
SGX_NR_TO_SCAN and SGX_NR_TO_SCAN_MAX.

From the changelog:

	1) To take a parameter that specifies the number of pages to scan for
	reclaiming. Define a max value of 32, but scan 16 in the case for the
	global reclaimer (ksgxd). 

It appears we want to make this function to scan @nr_to_scan for cgroup, but
still want to scan a fixed value for ksgxd, which is SGX_NR_TO_SCAN.  And
@nr_to_scan can be larger than SGX_NR_TO_SCAN but smaller than
SGX_NR_TO_SCAN_MAX.

Putting behind the mystery of why above is needed, to achieve it, is it more
clear if we do below?

int __sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age)
{
	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
	...

	if (nr_to_scan > SGX_NR_TO_SCAN_MAX)
		return 0;

	for (i = 0; i < nr_to_scan; i++) {
		...
	}

	return reclaimed;
}

/* This is for ksgxd() */
int sgx_reclaim_epc_page(void)
{
	return __sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
}

EPC cgroup calls __sgx_reclaim_epc_pages() directly, or introduce another
wrapper.






^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 15/18] x86/sgx: Prepare for multiple LRUs
  2023-09-23  3:06   ` Haitao Huang
  (?)
@ 2023-10-05 12:30   ` Huang, Kai
  2023-10-05 19:33     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-05 12:30 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> +static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
> +{
> +	return &sgx_global_lru;
> +}
> +
> +static inline bool sgx_can_reclaim(void)
> +{
> +	return !list_empty(&sgx_global_lru.reclaimable);
> +}
> +

Shouldn't sgx_can_reclaim() also take a 'struct sgx_epc_lru_lists *'?

I thought we also need to check whether a cgroup's LRU lists can be reclaimed?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  2023-10-05 12:24   ` Huang, Kai
@ 2023-10-05 19:23     ` Haitao Huang
  2023-10-05 20:25       ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-05 19:23 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Thu, 05 Oct 2023 07:24:12 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> Adjust and expose the top-level reclaim function as
>> sgx_reclaim_epc_pages() for use by the upcoming EPC cgroup, which will
>> initiate reclaim to enforce the max limit.
>>
>> Make these adjustments to the function signature.
>>
>> 1) To take a parameter that specifies the number of pages to scan for
>> reclaiming. Define a max value of 32, but scan 16 in the case for the
>> global reclaimer (ksgxd). The EPC cgroup will use it to specify a
>> desired number of pages to be reclaimed up to the max value of 32.
>>
>> 2) To take a flag to force reclaiming a page regardless of its age.  The
>> EPC cgroup will use the flag to enforce its limits by draining the
>> reclaimable lists before resorting to other measures, e.g. forcefully
>> kill enclaves.
>>
>> 3) Return the number of reclaimed pages. The EPC cgroup will use the
>> result to track reclaiming progress and escalate to a more forceful
>> reclaiming mode, e.g., calling this function with the flag to ignore age
>> of pages.
>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>> ---
>> V4:
>> - Combined the 3 patches that made the individual changes to the
>> function signature.
>> - Removed 'high' limit in commit message.
>> ---
>>  arch/x86/kernel/cpu/sgx/main.c | 31 +++++++++++++++++++++----------
>>  arch/x86/kernel/cpu/sgx/sgx.h  |  1 +
>>  2 files changed, 22 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/main.c  
>> b/arch/x86/kernel/cpu/sgx/main.c
>> index 3b875ab4dcd0..4e1a3e038db5 100644
>> --- a/arch/x86/kernel/cpu/sgx/main.c
>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>> @@ -18,6 +18,11 @@
>>  #include "encl.h"
>>  #include "encls.h"
>>
>> +/*
>> + * Maximum number of pages to scan for reclaiming.
>> + */
>> +#define SGX_NR_TO_SCAN_MAX	32
>> +
>>  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>>  static int sgx_nr_epc_sections;
>>  static struct task_struct *ksgxd_tsk;
>> @@ -279,7 +284,11 @@ static void sgx_reclaimer_write(struct  
>> sgx_epc_page *epc_page,
>>  	mutex_unlock(&encl->lock);
>>  }
>>
>> -/*
>> +/**
>> + * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
>> + * @nr_to_scan:		 Number of EPC pages to scan for reclaim
>> + * @ignore_age:		 Reclaim a page even if it is young
>> + *
>>   * Take a fixed number of pages from the head of the active page pool  
>> and
>>   * reclaim them to the enclave's private shmem files. Skip the pages,  
>> which have
>>   * been accessed since the last scan. Move those pages to the tail of  
>> active
>> @@ -292,15 +301,14 @@ static void sgx_reclaimer_write(struct  
>> sgx_epc_page *epc_page,
>>   * problematic as it would increase the lock contention too much,  
>> which would
>>   * halt forward progress.
>>   */
>> -static void sgx_reclaim_pages(void)
>> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>
> 'size_t' looks odd.  Any reason to use it?
>
> Given you only scan 32 at maximum, seems 'int' is good enough?
>

Initially was int.
Jarkko was suggesting ssize_t. I changed to size_t as this function will  
never return negative.

>>  {
>> -	struct sgx_backing backing[SGX_NR_TO_SCAN];
>> +	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>>  	struct sgx_epc_page *epc_page, *tmp;
>>  	struct sgx_encl_page *encl_page;
>>  	pgoff_t page_index;
>>  	LIST_HEAD(iso);
>> -	int ret;
>> -	int i;
>> +	size_t ret, i;
>>
>>  	spin_lock(&sgx_global_lru.lock);
>>  	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
>
This should be nr_to_scan
It was missed during some rebase and reordering operations.

> The function comment says
>
> 	* @nr_to_scan:		 Number of EPC pages to scan for reclaim
>
> But I don't see it is even used, if my eye isn't deceiving me?
> 	
>> @@ -326,13 +334,14 @@ static void sgx_reclaim_pages(void)
>>  	spin_unlock(&sgx_global_lru.lock);
>>
>>  	if (list_empty(&iso))
>> -		return;
>> +		return 0;
>>
>>  	i = 0;
>>  	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
>>  		encl_page = epc_page->encl_page;
>>
>> -		if (!sgx_reclaimer_age(epc_page))
>> +		if (i == SGX_NR_TO_SCAN_MAX ||
>
> i == nr_to_scan?
>
Not needed if above for statement fixed for nr_to_scan.
Anything above MAX will be skipped and put back to LRU.

> And should we have a
>
> 	if (nr_to_scan < SGX_NR_TO_SCAN_MAX)
> 		return 0;
>
> at the very beginning of this function?
>

  In final version caller to make sure not call with nr_to_scan not larger  
than SGX_NR_TO_SCAN_MAX

>> +		    (!ignore_age && !sgx_reclaimer_age(epc_page)))
>>  			goto skip;
>>
>>  		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
>> @@ -371,6 +380,8 @@ static void sgx_reclaim_pages(void)
>>
>>  		sgx_free_epc_page(epc_page);
>>  	}
>> +
>> +	return i;
>>  }
>>
>
> I found this function a little bit odd, given the mixing of 'nr_to_scan',
> SGX_NR_TO_SCAN and SGX_NR_TO_SCAN_MAX.
>
> From the changelog:
>
> 	1) To take a parameter that specifies the number of pages to scan for
> 	reclaiming. Define a max value of 32, but scan 16 in the case for the
> 	global reclaimer (ksgxd).
>
> It appears we want to make this function to scan @nr_to_scan for cgroup,  
> but
> still want to scan a fixed value for ksgxd, which is SGX_NR_TO_SCAN.  And
> @nr_to_scan can be larger than SGX_NR_TO_SCAN but smaller than
> SGX_NR_TO_SCAN_MAX.
>
> Putting behind the mystery of why above is needed, to achieve it, is it  
> more
> clear if we do below?
>
> int __sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age)
> {
> 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
> 	...
>
> 	if (nr_to_scan > SGX_NR_TO_SCAN_MAX)
> 		return 0;

We could set nr_to_scan to MAX but since this is code internal to driver,  
maybe just make sure callers don't call with bigger numbers.

>
> 	for (i = 0; i < nr_to_scan; i++) {
> 		...
> 	}
>

yes

> 	return reclaimed;
> }
>
> /* This is for ksgxd() */
> int sgx_reclaim_epc_page(void)
> {
> 	return __sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> }

Some maintainers may prefer no wrapping.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 15/18] x86/sgx: Prepare for multiple LRUs
  2023-10-05 12:30   ` Huang, Kai
@ 2023-10-05 19:33     ` Haitao Huang
  2023-10-05 20:38       ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-05 19:33 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Thu, 05 Oct 2023 07:30:46 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> +static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct  
>> sgx_epc_page *epc_page)
>> +{
>> +	return &sgx_global_lru;
>> +}
>> +
>> +static inline bool sgx_can_reclaim(void)
>> +{
>> +	return !list_empty(&sgx_global_lru.reclaimable);
>> +}
>> +
>
> Shouldn't sgx_can_reclaim() also take a 'struct sgx_epc_lru_lists *'?
>
> I thought we also need to check whether a cgroup's LRU lists can be  
> reclaimed?

This is only used to check if any pages reclaimable at the top level in  
this file. Later sgx_epc_cgroup_lru_empty(NULL) is used in this function  
to recursively check all cgroups starting from the root.

Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  2023-10-05 19:23     ` Haitao Huang
@ 2023-10-05 20:25       ` Huang, Kai
  0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-05 20:25 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, haitao.huang, tglx, tj, Mehta, Sohil
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo


> > > 
> > > -/*
> > > +/**
> > > + * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
> > > + * @nr_to_scan:		 Number of EPC pages to scan for reclaim
> > > + * @ignore_age:		 Reclaim a page even if it is young
> > > + *
> > >   * Take a fixed number of pages from the head of the active page pool  
> > > and
> > >   * reclaim them to the enclave's private shmem files. Skip the pages,  
> > > which have
> > >   * been accessed since the last scan. Move those pages to the tail of  
> > > active
> > > @@ -292,15 +301,14 @@ static void sgx_reclaimer_write(struct  
> > > sgx_epc_page *epc_page,
> > >   * problematic as it would increase the lock contention too much,  
> > > which would
> > >   * halt forward progress.
> > >   */
> > > -static void sgx_reclaim_pages(void)
> > > +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
> > 
> > 'size_t' looks odd.  Any reason to use it?
> > 
> > Given you only scan 32 at maximum, seems 'int' is good enough?
> > 
> 
> Initially was int.
> Jarkko was suggesting ssize_t. I changed to size_t as this function will  
> never return negative.

Then 'unsigned int'.  We are talking about 32 at max here.

size_t is more suitable for bytes, but we are dealing with number of pages.

Maybe Jarkko could comment why size_t is better.

[...]

> > 
> > >  	i = 0;
> > >  	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
> > >  		encl_page = epc_page->encl_page;
> > > 
> > > -		if (!sgx_reclaimer_age(epc_page))
> > > +		if (i == SGX_NR_TO_SCAN_MAX ||
> > 
> > i == nr_to_scan?
> > 
> Not needed if above for statement fixed for nr_to_scan.
> Anything above MAX will be skipped and put back to LRU.

I believe using nr_to_scan is more logically correct.

[...]


> > 
> > I found this function a little bit odd, given the mixing of 'nr_to_scan',
> > SGX_NR_TO_SCAN and SGX_NR_TO_SCAN_MAX.
> > 
> > From the changelog:
> > 
> > 	1) To take a parameter that specifies the number of pages to scan for
> > 	reclaiming. Define a max value of 32, but scan 16 in the case for the
> > 	global reclaimer (ksgxd).
> > 
> > It appears we want to make this function to scan @nr_to_scan for cgroup,  
> > but
> > still want to scan a fixed value for ksgxd, which is SGX_NR_TO_SCAN.  And
> > @nr_to_scan can be larger than SGX_NR_TO_SCAN but smaller than
> > SGX_NR_TO_SCAN_MAX.
> > 
> > Putting behind the mystery of why above is needed, to achieve it, is it  
> > more
> > clear if we do below?
> > 
> > int __sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age)
> > {
> > 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
> > 	...
> > 
> > 	if (nr_to_scan > SGX_NR_TO_SCAN_MAX)
> > 		return 0;
> 
> We could set nr_to_scan to MAX but since this is code internal to driver,  
> maybe just make sure callers don't call with bigger numbers.

Please add this check, using WARN_ON_ONCE() if it's better.

Then the code is much easier to review.

> 
> > 
> > 	for (i = 0; i < nr_to_scan; i++) {
> > 		...
> > 	}
> > 
> 
> yes

please fix this up, then ...

> 
> > 	return reclaimed;
> > }
> > 
> > /* This is for ksgxd() */
> > int sgx_reclaim_epc_page(void)
> > {
> > 	return __sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> > }
> 
> Some maintainers may prefer no wrapping.
> 

... OK.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 15/18] x86/sgx: Prepare for multiple LRUs
  2023-10-05 19:33     ` Haitao Huang
@ 2023-10-05 20:38       ` Huang, Kai
  0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-05 20:38 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, haitao.huang, tglx, tj, Mehta, Sohil
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Thu, 2023-10-05 at 14:33 -0500, Haitao Huang wrote:
> On Thu, 05 Oct 2023 07:30:46 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > +static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct  
> > > sgx_epc_page *epc_page)
> > > +{
> > > +	return &sgx_global_lru;
> > > +}
> > > +
> > > +static inline bool sgx_can_reclaim(void)
> > > +{
> > > +	return !list_empty(&sgx_global_lru.reclaimable);
> > > +}
> > > +
> > 
> > Shouldn't sgx_can_reclaim() also take a 'struct sgx_epc_lru_lists *'?
> > 
> > I thought we also need to check whether a cgroup's LRU lists can be  
> > reclaimed?
> 
> This is only used to check if any pages reclaimable at the top level in  
> this file. Later sgx_epc_cgroup_lru_empty(NULL) is used in this function  
> to recursively check all cgroups starting from the root.
> 
> 

This again falls to the "impossible to review unless review a later patch first"
category.  This patch says nothing about sgx_can_reclaim() will only be used at
the top level.  Even if it does, why cannot it take LRU lists as input?

All this patch says is we need to prepare these functions to suit multiple LRU
lists.

Btw, why sgx_reclaim_epc_pages() doesn't take LRU lists as input either?  Is it
possible that it can be called across multiple LRU lists, or across different
lists in one LRU?

Why do we need to find some particular LRU lists by given EPC page?

+static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page
*epc_page)
+{
+	return &sgx_global_lru;
+}
+

Maybe it's clear for other people, but to me it sounds some necessary design
background is missing at least.

Please try best to make the patch self-reviewable by justifying all of those.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-09-23  3:06   ` Haitao Huang
  (?)
  (?)
@ 2023-10-05 21:01   ` Huang, Kai
  -1 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-05 21:01 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish


> ---
>  arch/x86/Kconfig                     |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile     |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 415 +++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 ++++
>  arch/x86/kernel/cpu/sgx/main.c       |  68 ++++-
>  arch/x86/kernel/cpu/sgx/sgx.h        |  17 +-
>  6 files changed, 556 insertions(+), 17 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

Given how large this patch is, it's better to split if we can.

It seems we can at least split ...

[...]

> 
> @@ -970,6 +1005,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
>  static bool __init sgx_page_cache_init(void)
>  {
>  	u32 eax, ebx, ecx, edx, type;
> +	u64 capacity = 0;
>  	u64 pa, size;
>  	int nid;
>  	int i;
> @@ -1020,6 +1056,7 @@ static bool __init sgx_page_cache_init(void)
>  
>  		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
>  		sgx_numa_nodes[nid].size += size;
> +		capacity += size;
>  
>  		sgx_nr_epc_sections++;
>  	}
> @@ -1029,6 +1066,9 @@ static bool __init sgx_page_cache_init(void)
>  		return false;
>  	}
>  
> +	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +	sgx_epc_total_pages = capacity >> PAGE_SHIFT;
> +
>  	return true;
>  }
> 

... setting up capacity out as a separate patch, as it is a top-level only file
showing the maximum instances of the resource.

I'll review rest later.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-09-23  3:06   ` Haitao Huang
  (?)
@ 2023-10-09 23:45   ` Huang, Kai
  2023-10-10  0:23     ` Sean Christopherson
  2023-10-10  1:04     ` Haitao Huang
  -1 siblings, 2 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-09 23:45 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Introduce the OOM path for killing an enclave with a reclaimer that is no
> longer able to reclaim enough EPC pages. Find a victim enclave, which
> will be an enclave with only "unreclaimable" EPC pages left in the
> cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
> and zap the enclave's entire page range, and drain all mm references in
> encl->mm_list. Block allocating any EPC pages in #PF handler, or
> reloading any pages in all paths, or creating any new mappings.
> 
> The OOM killing path may race with the reclaimers: in some cases, the
> victim enclave is in the process of reclaiming the last EPC pages when
> OOM happens, that is, all pages other than SECS and VA pages are in
> RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
> the enclave backing, VA pages as well as SECS. So the OOM killer does
> not directly release those enclave resources, instead, it lets all
> reclaiming in progress to finish, and relies (as currently done) on
> kref_put on encl->refcount to trigger sgx_encl_release() to do the
> final cleanup.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
> V5:
> - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
> 
> V4:
> - Updates for patch reordering and typo fixes.
> 
> V3:
> - Rebased to use the new VMA_ITERATOR to zap VMAs.
> - Fixed the racing cases by blocking new page allocation/mapping and
> reloading when enclave is marked for OOM. And do not release any enclave
> resources other than draining mm_list entries, and let pages in
> RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
> - Due to above changes, also removed the no-longer needed encl->lock in
> the OOM path which was causing deadlocks reported by the lock prover.
> 

[...]

> +
> +/**
> + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> + * @lru:	LRU that is low
> + *
> + * Return:	%true if a victim was found and kicked.
> + */
> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> +{
> +	struct sgx_epc_page *victim;
> +
> +	spin_lock(&lru->lock);
> +	victim = sgx_oom_get_victim(lru);
> +	spin_unlock(&lru->lock);
> +
> +	if (!victim)
> +		return false;
> +
> +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> +		return sgx_oom_encl_page(victim->encl_page);
> +
> +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> +		return sgx_oom_encl(victim->encl);

I hate to bring this up, at least at this stage, but I am wondering why we need
to put VA and SECS pages to the unreclaimable list, but cannot keep an
"enclave_list" instead?

So by looking the patch (" x86/sgx: Limit process EPC usage with misc cgroup
controller"), if I am not missing anything, the whole "unreclaimable" list is
just used to find the victim enclave when OOM needs to be done.  Thus, I don't
see why "enclave_list" cannot be used to achieve this.

The reason that I am asking is because it seems using "enclave_list" we can
simplify the code.  At least the patches related to track VA/SECS pages, and the
SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated completely.  

Using "enclave_list", I guess you just need to put the enclave to the current
EPC cgroup when SECS page is allocated.

In fact, putting "unreclaimable" list to LRU itself is a little bit confusing
because: 1) you cannot really reclaim anything from the list; 2) VA/SECS pages
don't have the concept of "young" at all, thus makes no sense to annotate they
as LRU.

Thus putting VA/SECS to "unreclaimable" list, instead of keeping an
"enclave_list" seems won't have any benefit but will only make the code more
complicated.

Or am I missing anything?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-09-23  3:06   ` Haitao Huang
                     ` (2 preceding siblings ...)
  (?)
@ 2023-10-10  0:12   ` Huang, Kai
  -1 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-10  0:12 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish


> +/**
> + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its lrus
> + * @root:	root of the tree to check
> + *
> + * Return: %true if all cgroups under the specified root have empty LRU lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
> +{
> +	struct cgroup_subsys_state *css_root;
> +	struct cgroup_subsys_state *pos;
> +	struct sgx_epc_cgroup *epc_cg;
> +	bool ret = true;
> +
> +	/*
> +	 * Caller ensure css_root ref acquired
> +	 */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> +		spin_lock(&epc_cg->lru.lock);
> +		ret = list_empty(&epc_cg->lru.reclaimable);
> +		spin_unlock(&epc_cg->lru.lock);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!ret)
> +			break;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> 

[...]

> 
>  static inline bool sgx_can_reclaim(void)
>  {
> +	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> +		return !sgx_epc_cgroup_lru_empty(NULL);
> +

Is it better to keep a root sgx_epc_cgroup and pass the root instead of NULL?

>  	return !list_empty(&sgx_global_lru.reclaimable);
>  }
>  

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-09-23  3:06   ` Haitao Huang
                     ` (3 preceding siblings ...)
  (?)
@ 2023-10-10  0:16   ` Huang, Kai
  -1 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-10  0:16 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish


> +static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (epc_cg)
> +		return &epc_cg->lru;
> +	return NULL;
> +}
> 

It's legal to return NULL EPC cgroup for a given EPC page, i.e., when the
enclave isn't assigned to any cgroup.  But ...

>  
>  static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
>  {
> +	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> +		return epc_cg_lru(epc_page->epc_cg);
> +
>  	return &sgx_global_lru;
>  }

... here is it legal to return a NULL LRU list?

It appears you always want to return a valid LRU list.  That is, if EPC cgroup
is enabled, and when the EPC page doesn't belong to any cgroup, then you want to
return the sgx_global_lru ?


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-09 23:45   ` Huang, Kai
@ 2023-10-10  0:23     ` Sean Christopherson
  2023-10-10  0:50       ` Huang, Kai
  2023-10-10  1:42       ` Haitao Huang
  2023-10-10  1:04     ` Haitao Huang
  1 sibling, 2 replies; 144+ messages in thread
From: Sean Christopherson @ 2023-10-10  0:23 UTC (permalink / raw)
  To: Kai Huang
  Cc: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Sohil Mehta, tj, mingo, kristen,
	yangjie, Zhiquan1 Li, mikko.ylinen, Bo Zhang, anakrish

On Mon, Oct 09, 2023, Kai Huang wrote:
> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > +/**
> > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > + * @lru:	LRU that is low
> > + *
> > + * Return:	%true if a victim was found and kicked.
> > + */
> > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > +{
> > +	struct sgx_epc_page *victim;
> > +
> > +	spin_lock(&lru->lock);
> > +	victim = sgx_oom_get_victim(lru);
> > +	spin_unlock(&lru->lock);
> > +
> > +	if (!victim)
> > +		return false;
> > +
> > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > +		return sgx_oom_encl_page(victim->encl_page);
> > +
> > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > +		return sgx_oom_encl(victim->encl);
> 
> I hate to bring this up, at least at this stage, but I am wondering why we need
> to put VA and SECS pages to the unreclaimable list, but cannot keep an
> "enclave_list" instead?

The motivation for tracking EPC pages instead of enclaves was so that the EPC
OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual EPC code
didn't actually kill the VM process, it instead just freed all of the EPC pages
and abused the SGX architecture to effectively make the guest recreate all its
enclaves (IIRC, QEMU does the same thing to "support" live migration).

Looks like y'all punted on that with:

  The EPC pages allocated for KVM guests by the virtual EPC driver are not
  reclaimable by the host kernel [5]. Therefore they are not tracked by any
  LRU lists for reclaiming purposes in this implementation, but they are
  charged toward the cgroup of the user processs (e.g., QEMU) launching the
  guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
  driver will stop allocating more EPC for the VM, and return SIGBUS to the
  user process which would abort the VM launch.

which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  Relying
on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup
enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace running
encalves in a VM could continue on and refuse to give up its EPC, and thus run above
its limit in perpetuity.

I can see userspace wanting to explicitly terminate the VM instead of "silently"
the VM's enclaves, but that seems like it should be a knob in the virtual EPC
code.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-09-23  3:06   ` Haitao Huang
                     ` (4 preceding siblings ...)
  (?)
@ 2023-10-10  0:26   ` Huang, Kai
  2023-10-22 18:26     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-10  0:26 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish


> @@ -332,6 +336,7 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
>   * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
>   * @nr_to_scan:		 Number of EPC pages to scan for reclaim
>   * @ignore_age:		 Reclaim a page even if it is young
> + * @epc_cg:		 EPC cgroup from which to reclaim
>   *
>   * Take a fixed number of pages from the head of the active page pool and
>   * reclaim them to the enclave's private shmem files. Skip the pages, which have
> @@ -345,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
> +			     struct sgx_epc_cgroup *epc_cg)
>  {
>  	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>  	struct sgx_epc_page *epc_page, *tmp;
> @@ -355,7 +361,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>  	LIST_HEAD(iso);
>  	size_t ret, i;
>  
> -	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
> +	/*
> +	 * If a specific cgroup is not being targeted, take from the global
> +	 * list first, even when cgroups are enabled.  If there are
> +	 * pages on the global LRU then they should get reclaimed asap.
> +	 */
> +	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
> +		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
> +
> +	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);

(I wish such code can be somehow moved to the earlier patches, so that we can
get early idea that how sgx_reclaim_epc_pages() is supposed to be used.)

So here when we are not targeting a specific EPC cgroup, we always reclaim from
the global list first, ...

[...]

>  
>  	if (list_empty(&iso))
>  		return 0;
> @@ -423,7 +437,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  void sgx_reclaim_direct(void)
>  {
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);

... and we always try to reclaim the global list first when directly reclaim is
desired, even the enclave is within some EPC cgroup.  ... 

>  }
>  
>  static int ksgxd(void *p)
> @@ -446,7 +460,7 @@ static int ksgxd(void *p)
>  				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>  
>  		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> -			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);

... and in ksgxd() as well, which I guess is somehow acceptable.  ...

>  
>  		cond_resched();
>  	}
> @@ -600,6 +614,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  {
>  	struct sgx_epc_page *page;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_try_charge(reclaim);
> +	if (IS_ERR(epc_cg))
> +		return ERR_CAST(epc_cg);
>  
>  	for ( ; ; ) {
>  		page = __sgx_alloc_epc_page();
> @@ -608,8 +627,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		if (!sgx_can_reclaim())
> -			return ERR_PTR(-ENOMEM);
> +		if (!sgx_can_reclaim()) {
> +			page = ERR_PTR(-ENOMEM);
> +			break;
> +		}
>  
>  		if (!reclaim) {
>  			page = ERR_PTR(-EBUSY);
> @@ -621,10 +642,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);

... and when an EPC page is allocated, no matter whether the EPC page belongs to
any cgroup or not.

When we are allocating EPC page for one enclave, if that enclave belongs to some
cgroup, is it more reasonable to reclaim EPC pages from it's own group (and the
children under it)?

You already got the current EPC cgroup at the beginning of sgx_alloc_epc_page()
when you want to charge the EPC allocation.

>  		cond_resched();
>  	}
>  

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10  0:23     ` Sean Christopherson
@ 2023-10-10  0:50       ` Huang, Kai
  2023-10-10  1:34         ` Huang, Kai
  2023-10-10  1:42       ` Haitao Huang
  1 sibling, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-10  0:50 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, Li, Zhiquan1,
	dave.hansen, haitao.huang, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, Mehta, Sohil, mikko.ylinen, bp, x86,
	kristen

On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> On Mon, Oct 09, 2023, Kai Huang wrote:
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > + * @lru:	LRU that is low
> > > + *
> > > + * Return:	%true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > +	struct sgx_epc_page *victim;
> > > +
> > > +	spin_lock(&lru->lock);
> > > +	victim = sgx_oom_get_victim(lru);
> > > +	spin_unlock(&lru->lock);
> > > +
> > > +	if (!victim)
> > > +		return false;
> > > +
> > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > +		return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > +		return sgx_oom_encl(victim->encl);
> > 
> > I hate to bring this up, at least at this stage, but I am wondering why we need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > "enclave_list" instead?
> 
> The motivation for tracking EPC pages instead of enclaves was so that the EPC
> OOM-killer could "kill" VMs as well as host-owned enclaves.  
> 

Ah this seems a fair argument. :-)

> The virtual EPC code
> didn't actually kill the VM process, it instead just freed all of the EPC pages
> and abused the SGX architecture to effectively make the guest recreate all its
> enclaves (IIRC, QEMU does the same thing to "support" live migration).

It returns SIGBUS.  SGX VM live migration also requires enough EPC being able to
be allocated on the destination machine to work AFAICT.
 
> 
> Looks like y'all punted on that with:
> 
>   The EPC pages allocated for KVM guests by the virtual EPC driver are not
>   reclaimable by the host kernel [5]. Therefore they are not tracked by any
>   LRU lists for reclaiming purposes in this implementation, but they are
>   charged toward the cgroup of the user processs (e.g., QEMU) launching the
>   guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
>   driver will stop allocating more EPC for the VM, and return SIGBUS to the
>   user process which would abort the VM launch.
> 
> which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
> 

"enforced" do you mean?

Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate
EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(),
which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). 
And Qemu then kills the VM with some nonsense message:

        error: kvm run failed Bad address
        <dump guest registers nonsense>

> Relying
> on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup
> enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace running
> encalves in a VM could continue on and refuse to give up its EPC, and thus run above
> its limit in perpetuity.

Agreed.  But this looks cannot resolved until we can reclaim EPC page from VM.

Or in the EPC cgroup code we can refuse to set the maximum which cannot be
supported, e.g., less the total virtual EPC size.

I guess the second is acceptable for now?

> 
> I can see userspace wanting to explicitly terminate the VM instead of "silently"
> the VM's enclaves, but that seems like it should be a knob in the virtual EPC
> code.

See above for the second option.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-09 23:45   ` Huang, Kai
  2023-10-10  0:23     ` Sean Christopherson
@ 2023-10-10  1:04     ` Haitao Huang
  2023-10-10  1:18       ` Huang, Kai
  1 sibling, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-10  1:04 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> Introduce the OOM path for killing an enclave with a reclaimer that is  
>> no
>> longer able to reclaim enough EPC pages. Find a victim enclave, which
>> will be an enclave with only "unreclaimable" EPC pages left in the
>> cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
>> and zap the enclave's entire page range, and drain all mm references in
>> encl->mm_list. Block allocating any EPC pages in #PF handler, or
>> reloading any pages in all paths, or creating any new mappings.
>>
>> The OOM killing path may race with the reclaimers: in some cases, the
>> victim enclave is in the process of reclaiming the last EPC pages when
>> OOM happens, that is, all pages other than SECS and VA pages are in
>> RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
>> the enclave backing, VA pages as well as SECS. So the OOM killer does
>> not directly release those enclave resources, instead, it lets all
>> reclaiming in progress to finish, and relies (as currently done) on
>> kref_put on encl->refcount to trigger sgx_encl_release() to do the
>> final cleanup.
>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>> ---
>> V5:
>> - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
>>
>> V4:
>> - Updates for patch reordering and typo fixes.
>>
>> V3:
>> - Rebased to use the new VMA_ITERATOR to zap VMAs.
>> - Fixed the racing cases by blocking new page allocation/mapping and
>> reloading when enclave is marked for OOM. And do not release any enclave
>> resources other than draining mm_list entries, and let pages in
>> RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
>> - Due to above changes, also removed the no-longer needed encl->lock in
>> the OOM path which was causing deadlocks reported by the lock prover.
>>
>
> [...]
>
>> +
>> +/**
>> + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
>> + * @lru:	LRU that is low
>> + *
>> + * Return:	%true if a victim was found and kicked.
>> + */
>> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
>> +{
>> +	struct sgx_epc_page *victim;
>> +
>> +	spin_lock(&lru->lock);
>> +	victim = sgx_oom_get_victim(lru);
>> +	spin_unlock(&lru->lock);
>> +
>> +	if (!victim)
>> +		return false;
>> +
>> +	if (victim->flags & SGX_EPC_OWNER_PAGE)
>> +		return sgx_oom_encl_page(victim->encl_page);
>> +
>> +	if (victim->flags & SGX_EPC_OWNER_ENCL)
>> +		return sgx_oom_encl(victim->encl);
>
> I hate to bring this up, at least at this stage, but I am wondering why  
> we need
> to put VA and SECS pages to the unreclaimable list, but cannot keep an
> "enclave_list" instead?
>
> So by looking the patch (" x86/sgx: Limit process EPC usage with misc  
> cgroup
> controller"), if I am not missing anything, the whole "unreclaimable"  
> list is
> just used to find the victim enclave when OOM needs to be done.  Thus, I  
> don't
> see why "enclave_list" cannot be used to achieve this.
>
> The reason that I am asking is because it seems using "enclave_list" we  
> can
> simplify the code.  At least the patches related to track VA/SECS pages,  
> and the
> SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated  
> completely.  
> Using "enclave_list", I guess you just need to put the enclave to the  
> current
> EPC cgroup when SECS page is allocated.
>
Later the hosting process could migrated/reassigned to another cgroup?
What to do when the new cgroup is OOM?

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10  1:04     ` Haitao Huang
@ 2023-10-10  1:18       ` Huang, Kai
  2023-10-10  1:38         ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-10  1:18 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, haitao.huang, tglx, tj, Mehta, Sohil
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Mon, 2023-10-09 at 20:04 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > 
> > > Introduce the OOM path for killing an enclave with a reclaimer that is  
> > > no
> > > longer able to reclaim enough EPC pages. Find a victim enclave, which
> > > will be an enclave with only "unreclaimable" EPC pages left in the
> > > cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
> > > and zap the enclave's entire page range, and drain all mm references in
> > > encl->mm_list. Block allocating any EPC pages in #PF handler, or
> > > reloading any pages in all paths, or creating any new mappings.
> > > 
> > > The OOM killing path may race with the reclaimers: in some cases, the
> > > victim enclave is in the process of reclaiming the last EPC pages when
> > > OOM happens, that is, all pages other than SECS and VA pages are in
> > > RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
> > > the enclave backing, VA pages as well as SECS. So the OOM killer does
> > > not directly release those enclave resources, instead, it lets all
> > > reclaiming in progress to finish, and relies (as currently done) on
> > > kref_put on encl->refcount to trigger sgx_encl_release() to do the
> > > final cleanup.
> > > 
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> > > Cc: Sean Christopherson <seanjc@google.com>
> > > ---
> > > V5:
> > > - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
> > > 
> > > V4:
> > > - Updates for patch reordering and typo fixes.
> > > 
> > > V3:
> > > - Rebased to use the new VMA_ITERATOR to zap VMAs.
> > > - Fixed the racing cases by blocking new page allocation/mapping and
> > > reloading when enclave is marked for OOM. And do not release any enclave
> > > resources other than draining mm_list entries, and let pages in
> > > RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
> > > - Due to above changes, also removed the no-longer needed encl->lock in
> > > the OOM path which was causing deadlocks reported by the lock prover.
> > > 
> > 
> > [...]
> > 
> > > +
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > + * @lru:	LRU that is low
> > > + *
> > > + * Return:	%true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > +	struct sgx_epc_page *victim;
> > > +
> > > +	spin_lock(&lru->lock);
> > > +	victim = sgx_oom_get_victim(lru);
> > > +	spin_unlock(&lru->lock);
> > > +
> > > +	if (!victim)
> > > +		return false;
> > > +
> > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > +		return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > +		return sgx_oom_encl(victim->encl);
> > 
> > I hate to bring this up, at least at this stage, but I am wondering why  
> > we need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > "enclave_list" instead?
> > 
> > So by looking the patch (" x86/sgx: Limit process EPC usage with misc  
> > cgroup
> > controller"), if I am not missing anything, the whole "unreclaimable"  
> > list is
> > just used to find the victim enclave when OOM needs to be done.  Thus, I  
> > don't
> > see why "enclave_list" cannot be used to achieve this.
> > 
> > The reason that I am asking is because it seems using "enclave_list" we  
> > can
> > simplify the code.  At least the patches related to track VA/SECS pages,  
> > and the
> > SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated  
> > completely.  
> > Using "enclave_list", I guess you just need to put the enclave to the  
> > current
> > EPC cgroup when SECS page is allocated.
> > 
> Later the hosting process could migrated/reassigned to another cgroup?
> What to do when the new cgroup is OOM?
> 

You addressed in the documentation, no?

+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it
+remains charged to the original cgroup until the page is released
+or reclaimed.  Migrating a process to a different cgroup doesn't
+move the EPC charges that it incurred while in the previous cgroup
+to its new cgroup.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10  0:50       ` Huang, Kai
@ 2023-10-10  1:34         ` Huang, Kai
  2023-10-10 16:49           ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-10  1:34 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, haitao.huang, linux-kernel, mingo, tglx, tj, anakrish,
	jarkko, hpa, mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote:
> On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > +/**
> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > > + * @lru:	LRU that is low
> > > > + *
> > > > + * Return:	%true if a victim was found and kicked.
> > > > + */
> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > +{
> > > > +	struct sgx_epc_page *victim;
> > > > +
> > > > +	spin_lock(&lru->lock);
> > > > +	victim = sgx_oom_get_victim(lru);
> > > > +	spin_unlock(&lru->lock);
> > > > +
> > > > +	if (!victim)
> > > > +		return false;
> > > > +
> > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > +		return sgx_oom_encl_page(victim->encl_page);
> > > > +
> > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > +		return sgx_oom_encl(victim->encl);
> > > 
> > > I hate to bring this up, at least at this stage, but I am wondering why we need
> > > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > > "enclave_list" instead?
> > 
> > The motivation for tracking EPC pages instead of enclaves was so that the EPC
> > OOM-killer could "kill" VMs as well as host-owned enclaves.  
> > 
> 
> Ah this seems a fair argument. :-)
> 
> > The virtual EPC code
> > didn't actually kill the VM process, it instead just freed all of the EPC pages
> > and abused the SGX architecture to effectively make the guest recreate all its
> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> 
> It returns SIGBUS.  SGX VM live migration also requires enough EPC being able to
> be allocated on the destination machine to work AFAICT.
>  
> > 
> > Looks like y'all punted on that with:
> > 
> >   The EPC pages allocated for KVM guests by the virtual EPC driver are not
> >   reclaimable by the host kernel [5]. Therefore they are not tracked by any
> >   LRU lists for reclaiming purposes in this implementation, but they are
> >   charged toward the cgroup of the user processs (e.g., QEMU) launching the
> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
> >   driver will stop allocating more EPC for the VM, and return SIGBUS to the
> >   user process which would abort the VM launch.
> > 
> > which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
> > 
> 
> "enforced" do you mean?
> 
> Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate
> EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(),
> which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). 
> And Qemu then kills the VM with some nonsense message:
> 
>         error: kvm run failed Bad address
>         <dump guest registers nonsense>
> 
> > Relying
> > on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup
> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace running
> > encalves in a VM could continue on and refuse to give up its EPC, and thus run above
> > its limit in perpetuity.
> 
> > 
> > I can see userspace wanting to explicitly terminate the VM instead of "silently"
> > the VM's enclaves, but that seems like it should be a knob in the virtual EPC
> > code.

I guess I slightly misunderstood your words.

You mean we want to kill VM when the limit is set to be lower than virtual EPC
size.

This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual EPC too?

In the sgx_vepc_fault(), we check this flag at early time and return SIGBUS if
it is set.

But this also requires keeping virtual EPC pages in some list, and handles them
in sgx_epc_oom() too.

And for virtual EPC pages, I guess the "young" logic can be applied thus
probably it's better to keep the actual virtual EPC pages to a (separate?) list
instead of keeping the virtual EPC instance.

	struct sgx_epc_lru {
		struct list_head reclaimable;
		struct sgx_encl *enclaves;
		struct list_head vepc_pages;
	}

Or still tracking VA/SECS and virtual EPC pages in a single unrecliamable list?

I don't know :-)

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10  1:18       ` Huang, Kai
@ 2023-10-10  1:38         ` Haitao Huang
  2023-10-10  2:12           ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-10  1:38 UTC (permalink / raw)
  To: mingo, linux-sgx, x86, dave.hansen, cgroups, hpa, linux-kernel,
	jarkko, bp, tglx, tj, Mehta, Sohil, Huang, Kai
  Cc: kristen, anakrish, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, yangjie, Zhang, Bo

On Mon, 09 Oct 2023 20:18:00 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-10-09 at 20:04 -0500, Haitao Huang wrote:
>> On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
>> > >
>> > > Introduce the OOM path for killing an enclave with a reclaimer that  
>> is
>> > > no
>> > > longer able to reclaim enough EPC pages. Find a victim enclave,  
>> which
>> > > will be an enclave with only "unreclaimable" EPC pages left in the
>> > > cgroup LRU lists. Once a victim is identified, mark the enclave as  
>> OOM
>> > > and zap the enclave's entire page range, and drain all mm  
>> references in
>> > > encl->mm_list. Block allocating any EPC pages in #PF handler, or
>> > > reloading any pages in all paths, or creating any new mappings.
>> > >
>> > > The OOM killing path may race with the reclaimers: in some cases,  
>> the
>> > > victim enclave is in the process of reclaiming the last EPC pages  
>> when
>> > > OOM happens, that is, all pages other than SECS and VA pages are in
>> > > RECLAIMING_IN_PROGRESS state. The reclaiming process requires  
>> access to
>> > > the enclave backing, VA pages as well as SECS. So the OOM killer  
>> does
>> > > not directly release those enclave resources, instead, it lets all
>> > > reclaiming in progress to finish, and relies (as currently done) on
>> > > kref_put on encl->refcount to trigger sgx_encl_release() to do the
>> > > final cleanup.
>> > >
>> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> > > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> > > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> > > Cc: Sean Christopherson <seanjc@google.com>
>> > > ---
>> > > V5:
>> > > - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
>> > >
>> > > V4:
>> > > - Updates for patch reordering and typo fixes.
>> > >
>> > > V3:
>> > > - Rebased to use the new VMA_ITERATOR to zap VMAs.
>> > > - Fixed the racing cases by blocking new page allocation/mapping and
>> > > reloading when enclave is marked for OOM. And do not release any  
>> enclave
>> > > resources other than draining mm_list entries, and let pages in
>> > > RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
>> > > - Due to above changes, also removed the no-longer needed  
>> encl->lock in
>> > > the OOM path which was causing deadlocks reported by the lock  
>> prover.
>> > >
>> >
>> > [...]
>> >
>> > > +
>> > > +/**
>> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
>> > > + * @lru:	LRU that is low
>> > > + *
>> > > + * Return:	%true if a victim was found and kicked.
>> > > + */
>> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
>> > > +{
>> > > +	struct sgx_epc_page *victim;
>> > > +
>> > > +	spin_lock(&lru->lock);
>> > > +	victim = sgx_oom_get_victim(lru);
>> > > +	spin_unlock(&lru->lock);
>> > > +
>> > > +	if (!victim)
>> > > +		return false;
>> > > +
>> > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
>> > > +		return sgx_oom_encl_page(victim->encl_page);
>> > > +
>> > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
>> > > +		return sgx_oom_encl(victim->encl);
>> >
>> > I hate to bring this up, at least at this stage, but I am wondering  
>> why
>> > we need
>> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
>> > "enclave_list" instead?
>> >
>> > So by looking the patch (" x86/sgx: Limit process EPC usage with misc
>> > cgroup
>> > controller"), if I am not missing anything, the whole "unreclaimable"
>> > list is
>> > just used to find the victim enclave when OOM needs to be done.   
>> Thus, I
>> > don't
>> > see why "enclave_list" cannot be used to achieve this.
>> >
>> > The reason that I am asking is because it seems using "enclave_list"  
>> we
>> > can
>> > simplify the code.  At least the patches related to track VA/SECS  
>> pages,
>> > and the
>> > SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated
>> > completely.
>> > Using "enclave_list", I guess you just need to put the enclave to the
>> > current
>> > EPC cgroup when SECS page is allocated.
>> >
>> Later the hosting process could migrated/reassigned to another cgroup?
>> What to do when the new cgroup is OOM?
>>
>
> You addressed in the documentation, no?
>
> +Migration
> +---------
> +
> +Once an EPC page is charged to a cgroup (during allocation), it
> +remains charged to the original cgroup until the page is released
> +or reclaimed.  Migrating a process to a different cgroup doesn't
> +move the EPC charges that it incurred while in the previous cgroup
> +to its new cgroup.

Should we kill the enclave though because some VA pages may be in the new  
group?

Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10  0:23     ` Sean Christopherson
  2023-10-10  0:50       ` Huang, Kai
@ 2023-10-10  1:42       ` Haitao Huang
  2023-10-10  2:23         ` Huang, Kai
  1 sibling, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-10  1:42 UTC (permalink / raw)
  To: Kai Huang, Sean Christopherson
  Cc: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Sohil Mehta, tj, mingo, kristen, yangjie,
	Zhiquan1 Li, mikko.ylinen, Bo Zhang, anakrish

Hi Sean

On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson  
<seanjc@google.com> wrote:

> On Mon, Oct 09, 2023, Kai Huang wrote:
>> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> > +/**
>> > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
>> > + * @lru:	LRU that is low
>> > + *
>> > + * Return:	%true if a victim was found and kicked.
>> > + */
>> > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
>> > +{
>> > +	struct sgx_epc_page *victim;
>> > +
>> > +	spin_lock(&lru->lock);
>> > +	victim = sgx_oom_get_victim(lru);
>> > +	spin_unlock(&lru->lock);
>> > +
>> > +	if (!victim)
>> > +		return false;
>> > +
>> > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
>> > +		return sgx_oom_encl_page(victim->encl_page);
>> > +
>> > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
>> > +		return sgx_oom_encl(victim->encl);
>>
>> I hate to bring this up, at least at this stage, but I am wondering why  
>> we need
>> to put VA and SECS pages to the unreclaimable list, but cannot keep an
>> "enclave_list" instead?
>
> The motivation for tracking EPC pages instead of enclaves was so that  
> the EPC
> OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual  
> EPC code
> didn't actually kill the VM process, it instead just freed all of the  
> EPC pages
> and abused the SGX architecture to effectively make the guest recreate  
> all its
> enclaves (IIRC, QEMU does the same thing to "support" live migration).
>
> Looks like y'all punted on that with:
>
>   The EPC pages allocated for KVM guests by the virtual EPC driver are  
> not
>   reclaimable by the host kernel [5]. Therefore they are not tracked by  
> any
>   LRU lists for reclaiming purposes in this implementation, but they are
>   charged toward the cgroup of the user processs (e.g., QEMU) launching  
> the
>   guest.  And when the cgroup EPC usage reaches its limit, the virtual  
> EPC
>   driver will stop allocating more EPC for the VM, and return SIGBUS to  
> the
>   user process which would abort the VM launch.
>
> which IMO is a hack, unless returning SIGBUS is actually enforced  
> somehow.  Relying
> on userspace to be kind enough to kill its VMs kinda defeats the purpose  
> of cgroup
> enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> userspace running
> encalves in a VM could continue on and refuse to give up its EPC, and  
> thus run above
> its limit in perpetuity.
>
Cgroup would refuse to allocate more when limit is reached so VMs can not  
run above limit.

IIRC VMs only support static EPC size right now, reaching limit at launch  
means the EPC size given in command line for QEMU is not appropriate. So  
VM should not launch, hence the current behavior.

[all EPC pages in guest are allocated on page fault caused by the  
sensitization process in guest kernel during init, which is part of the VM  
Launch process. So SIGNBUS will turn into failed VM launch.]

Once it is launched, guest kernel would have 'total capacity' given by the  
static value from QEMU option. And it would start paging when it is used  
up, never would ask for more from host.

For future with dynamic EPC for running guests, QEMU could handle  
allocation failure and pass SIGBUS to the running guest kernel.  Is that  
correct understanding?


> I can see userspace wanting to explicitly terminate the VM instead of  
> "silently"
> the VM's enclaves, but that seems like it should be a knob in the  
> virtual EPC
> code.

If my understanding above is correct and understanding your statement  
above correctly, then don't see we really need separate knob for vEPC  
code. Reaching a cgroup limit by a running guest (assuming dynamic  
allocation implemented) should not translate automatically killing the VM.  
Instead, it's user space job to work with guest to handle allocation  
failure. Guest could page and kill enclaves.

Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10  1:38         ` Haitao Huang
@ 2023-10-10  2:12           ` Huang, Kai
  2023-10-10 17:05             ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-10  2:12 UTC (permalink / raw)
  To: tj, linux-sgx, dave.hansen, x86, cgroups, hpa, mingo,
	linux-kernel, bp, haitao.huang, tglx, jarkko, Mehta, Sohil
  Cc: kristen, Zhang, Bo, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, anakrish, yangjie


> > > > 
> > > Later the hosting process could migrated/reassigned to another cgroup?
> > > What to do when the new cgroup is OOM?
> > > 
> > 
> > You addressed in the documentation, no?
> > 
> > +Migration
> > +---------
> > +
> > +Once an EPC page is charged to a cgroup (during allocation), it
> > +remains charged to the original cgroup until the page is released
> > +or reclaimed.  Migrating a process to a different cgroup doesn't
> > +move the EPC charges that it incurred while in the previous cgroup
> > +to its new cgroup.
> 
> Should we kill the enclave though because some VA pages may be in the new  
> group?
> 

I guess acceptable?

And any difference if you keep VA/SECS to unreclaimabe list? If you migrate one
enclave to another cgroup, the old EPC pages stay in the old cgroup while the
new one is charged to the new group IIUC.

I am not cgroup expert, but by searching some old thread it appears this isn't a
supported model:

https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10  1:42       ` Haitao Huang
@ 2023-10-10  2:23         ` Huang, Kai
  2023-10-10 13:26           ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-10  2:23 UTC (permalink / raw)
  To: Christopherson,, Sean, haitao.huang
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, Li, Zhiquan1,
	dave.hansen, linux-kernel, mingo, tglx, tj, anakrish, jarkko,
	hpa, mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
> Hi Sean
> 
> On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson  
> <seanjc@google.com> wrote:
> 
> > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > +/**
> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > > + * @lru:	LRU that is low
> > > > + *
> > > > + * Return:	%true if a victim was found and kicked.
> > > > + */
> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > +{
> > > > +	struct sgx_epc_page *victim;
> > > > +
> > > > +	spin_lock(&lru->lock);
> > > > +	victim = sgx_oom_get_victim(lru);
> > > > +	spin_unlock(&lru->lock);
> > > > +
> > > > +	if (!victim)
> > > > +		return false;
> > > > +
> > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > +		return sgx_oom_encl_page(victim->encl_page);
> > > > +
> > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > +		return sgx_oom_encl(victim->encl);
> > > 
> > > I hate to bring this up, at least at this stage, but I am wondering why  
> > > we need
> > > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > > "enclave_list" instead?
> > 
> > The motivation for tracking EPC pages instead of enclaves was so that  
> > the EPC
> > OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual  
> > EPC code
> > didn't actually kill the VM process, it instead just freed all of the  
> > EPC pages
> > and abused the SGX architecture to effectively make the guest recreate  
> > all its
> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> > 
> > Looks like y'all punted on that with:
> > 
> >   The EPC pages allocated for KVM guests by the virtual EPC driver are  
> > not
> >   reclaimable by the host kernel [5]. Therefore they are not tracked by  
> > any
> >   LRU lists for reclaiming purposes in this implementation, but they are
> >   charged toward the cgroup of the user processs (e.g., QEMU) launching  
> > the
> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual  
> > EPC
> >   driver will stop allocating more EPC for the VM, and return SIGBUS to  
> > the
> >   user process which would abort the VM launch.
> > 
> > which IMO is a hack, unless returning SIGBUS is actually enforced  
> > somehow.  Relying
> > on userspace to be kind enough to kill its VMs kinda defeats the purpose  
> > of cgroup
> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> > userspace running
> > encalves in a VM could continue on and refuse to give up its EPC, and  
> > thus run above
> > its limit in perpetuity.
> > 
> Cgroup would refuse to allocate more when limit is reached so VMs can not  
> run above limit.
> 
> IIRC VMs only support static EPC size right now, reaching limit at launch  
> means the EPC size given in command line for QEMU is not appropriate. So  
> VM should not launch, hence the current behavior.
> 
> [all EPC pages in guest are allocated on page fault caused by the  
> sensitization process in guest kernel during init, which is part of the VM  
> Launch process. So SIGNBUS will turn into failed VM launch.]
> 
> Once it is launched, guest kernel would have 'total capacity' given by the  
> static value from QEMU option. And it would start paging when it is used  
> up, never would ask for more from host.
> 
> For future with dynamic EPC for running guests, QEMU could handle  
> allocation failure and pass SIGBUS to the running guest kernel.  Is that  
> correct understanding?
> 
> 
> > I can see userspace wanting to explicitly terminate the VM instead of  
> > "silently"
> > the VM's enclaves, but that seems like it should be a knob in the  
> > virtual EPC
> > code.
> 
> If my understanding above is correct and understanding your statement  
> above correctly, then don't see we really need separate knob for vEPC  
> code. Reaching a cgroup limit by a running guest (assuming dynamic  
> allocation implemented) should not translate automatically killing the VM.  
> Instead, it's user space job to work with guest to handle allocation  
> failure. Guest could page and kill enclaves.
> 

IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:

1) misc.max = 100M
2) Launch VMs with total virtual EPC size = 100M	<- success
3) misc.max = 50M

3) will also succeed, but nothing will happen, the VMs will be still holding
100M EPC.

You need to somehow track virtual EPC and kill VM instead.

(or somehow fail to do 3) if it is also an acceptable option.)


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-09-23  3:06   ` Haitao Huang
                     ` (5 preceding siblings ...)
  (?)
@ 2023-10-10  9:19   ` Huang, Kai
  -1 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-10  9:19 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish


> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> +	if (cg)
> +		return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +
> +	return NULL;
> +}
> +
> 

Is it good idea to allow passing a NULL @cg to this basic function?

Why not only call this function when @cg is valid?

> +
> +static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
> +				       bool reclaim)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	unsigned int nr_empty = 0;
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +
> +	for (;;) {
> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> +					PAGE_SIZE))
> +			break;
> +
> +		if (sgx_epc_cgroup_lru_empty(epc_cg))
> +			return -ENOMEM;
> +
> +		if (signal_pending(current))
> +			return -ERESTARTSYS;
> +
> +		if (!reclaim) {
> +			queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
> +			return -EBUSY;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(1, &rc)) {
> +			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
> +				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
> +					return -ENOMEM;
> +				schedule();
> +			}
> +		}
> +	}
> +	if (epc_cg->cg != misc_cg_root())
> +		css_get(&epc_cg->cg->css);

I don't quite understand why root is treated specially.

And I thought get_current_misc_cg() in sgx_epc_cgroup_try_charge() already grabs
the reference before calling this function?  Why do it again?

> +
> +	return 0;
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
> + * @mm:			the mm_struct of the process to charge
> + * @reclaim:		whether or not synchronous reclaim is allowed
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.
> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +	int ret;
> +
> +	if (sgx_epc_cgroup_disabled())
> +		return NULL;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> +	ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
> +	put_misc_cg(epc_cg->cg);
> +
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return epc_cg;
> +}
> +
> +/**
> + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
> + * @epc_cg:	the charged epc cgroup
> + */
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (sgx_epc_cgroup_disabled())
> +		return;
> +
> +	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> +	if (epc_cg->cg != misc_cg_root())
> +		put_misc_cg(epc_cg->cg);

Again why root is special?  And where is the get_misc_cg()?

Oh is it the 

	if (epc_cg->cg != misc_cg_root())
		css_get(&epc_cg->cg->css);

in __sgx_epc_cgroup_try_charge()?

That's horrible to follow.  Can this be explicitly done in
sgx_epc_cgroup_try_charge() and sgx_epc_cgroup_uncharge(), that is, grab the
reference in the former and release the reference in the latter?


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-09-23  3:06   ` Haitao Huang
                     ` (6 preceding siblings ...)
  (?)
@ 2023-10-10  9:32   ` Huang, Kai
  -1 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-10  9:32 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, haitao.huang, Mehta, Sohil, tj, mingo
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
> 
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
> 
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
> 
> This patch was modified from its original version to use the misc cgroup
> controller instead of a custom controller.
> 
> 

[...]

> 
> 7) Other minor refactoring:
> - Remove unused params in epc_cgroup APIs
> - centralize uncharge into sgx_free_epc_page()
> ---
>  arch/x86/Kconfig                     |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile     |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 415 +++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 ++++
>  arch/x86/kernel/cpu/sgx/main.c       |  68 ++++-
>  arch/x86/kernel/cpu/sgx/sgx.h        |  17 +-
>  6 files changed, 556 insertions(+), 17 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

As mentioned before, this patch is pretty large thus it's hard to review.  I
think we should try to split into smaller patches so that they can be
reviewable.

I cannot recall how many times that I've done scroll up/down just to find some
function.

Any idea to further split this patch?

Also, I am thinking _perhaps_ the way of organizing the patches of this patchset
can be improved.  I had an impression that this patchset is organized in this
way:  

1) There are many small patches to adjust the elemental code pieces to suit EPC
cgroup support, but many of them don't have enough design information to
justify, but only says "EPC cgroup will use later" etc.

2) And then the EPC cgroup is implemented in one large patch at the end.

Both 1) and 2) are hard to review.  I need to do a lot of back and forth to
review this series.

I am not finger pointing at anything because it's not easy at all, but just want
to explore options that may make this series easier to review.

For instance, will below make more sense:

Instead of implementing EPC cgroup in one big patch, we introduce key
structures, elemental helpers in separate patch(es) at early position so that
it's easy to review some basic logic/conversion.

And then we may move some key logic of handling EPC cgroup, e.g., reclaim logic,
to early patches when we adjust the elemental code pieces for EPC cgroup.

Perhaps it's worth to try, but just my 2cents.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10  2:23         ` Huang, Kai
@ 2023-10-10 13:26           ` Haitao Huang
  2023-10-11  0:01             ` Sean Christopherson
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-10 13:26 UTC (permalink / raw)
  To: Christopherson,, Sean, Huang, Kai
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, Li, Zhiquan1,
	dave.hansen, linux-kernel, mingo, tglx, tj, anakrish, jarkko,
	hpa, mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
>> Hi Sean
>>
>> On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson
>> <seanjc@google.com> wrote:
>>
>> > On Mon, Oct 09, 2023, Kai Huang wrote:
>> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> > > > +/**
>> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
>> LRU
>> > > > + * @lru:	LRU that is low
>> > > > + *
>> > > > + * Return:	%true if a victim was found and kicked.
>> > > > + */
>> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
>> > > > +{
>> > > > +	struct sgx_epc_page *victim;
>> > > > +
>> > > > +	spin_lock(&lru->lock);
>> > > > +	victim = sgx_oom_get_victim(lru);
>> > > > +	spin_unlock(&lru->lock);
>> > > > +
>> > > > +	if (!victim)
>> > > > +		return false;
>> > > > +
>> > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
>> > > > +		return sgx_oom_encl_page(victim->encl_page);
>> > > > +
>> > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
>> > > > +		return sgx_oom_encl(victim->encl);
>> > >
>> > > I hate to bring this up, at least at this stage, but I am wondering  
>> why
>> > > we need
>> > > to put VA and SECS pages to the unreclaimable list, but cannot keep  
>> an
>> > > "enclave_list" instead?
>> >
>> > The motivation for tracking EPC pages instead of enclaves was so that
>> > the EPC
>> > OOM-killer could "kill" VMs as well as host-owned enclaves.  The  
>> virtual
>> > EPC code
>> > didn't actually kill the VM process, it instead just freed all of the
>> > EPC pages
>> > and abused the SGX architecture to effectively make the guest recreate
>> > all its
>> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
>> >
>> > Looks like y'all punted on that with:
>> >
>> >   The EPC pages allocated for KVM guests by the virtual EPC driver are
>> > not
>> >   reclaimable by the host kernel [5]. Therefore they are not tracked  
>> by
>> > any
>> >   LRU lists for reclaiming purposes in this implementation, but they  
>> are
>> >   charged toward the cgroup of the user processs (e.g., QEMU)  
>> launching
>> > the
>> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual
>> > EPC
>> >   driver will stop allocating more EPC for the VM, and return SIGBUS  
>> to
>> > the
>> >   user process which would abort the VM launch.
>> >
>> > which IMO is a hack, unless returning SIGBUS is actually enforced
>> > somehow.  Relying
>> > on userspace to be kind enough to kill its VMs kinda defeats the  
>> purpose
>> > of cgroup
>> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,
>> > userspace running
>> > encalves in a VM could continue on and refuse to give up its EPC, and
>> > thus run above
>> > its limit in perpetuity.
>> >
>> Cgroup would refuse to allocate more when limit is reached so VMs can  
>> not
>> run above limit.
>>
>> IIRC VMs only support static EPC size right now, reaching limit at  
>> launch
>> means the EPC size given in command line for QEMU is not appropriate. So
>> VM should not launch, hence the current behavior.
>>
>> [all EPC pages in guest are allocated on page fault caused by the
>> sensitization process in guest kernel during init, which is part of the  
>> VM
>> Launch process. So SIGNBUS will turn into failed VM launch.]
>>
>> Once it is launched, guest kernel would have 'total capacity' given by  
>> the
>> static value from QEMU option. And it would start paging when it is used
>> up, never would ask for more from host.
>>
>> For future with dynamic EPC for running guests, QEMU could handle
>> allocation failure and pass SIGBUS to the running guest kernel.  Is that
>> correct understanding?
>>
>>
>> > I can see userspace wanting to explicitly terminate the VM instead of
>> > "silently"
>> > the VM's enclaves, but that seems like it should be a knob in the
>> > virtual EPC
>> > code.
>>
>> If my understanding above is correct and understanding your statement
>> above correctly, then don't see we really need separate knob for vEPC
>> code. Reaching a cgroup limit by a running guest (assuming dynamic
>> allocation implemented) should not translate automatically killing the  
>> VM.
>> Instead, it's user space job to work with guest to handle allocation
>> failure. Guest could page and kill enclaves.
>>
>
> IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:
>
> 1) misc.max = 100M
> 2) Launch VMs with total virtual EPC size = 100M	<- success
> 3) misc.max = 50M
>
> 3) will also succeed, but nothing will happen, the VMs will be still  
> holding
> 100M EPC.
>
> You need to somehow track virtual EPC and kill VM instead.
>
> (or somehow fail to do 3) if it is also an acceptable option.)
>
Thanks for explaining it.

There is an error code to return from max_write. I can add that too to the  
callback definition and fail it when it can't be enforced for any reason.
Would like some community feedback if this is acceptable though.

I think to solve it ultimately, we need be able to adjust 'capacity' of  
VMs not to just kill them, which is basically the same as dynamic  
allocation support for VMs (being able to increase/decrease epc size when  
it is running). For now, we only have static allocation so max can't be  
enforced once it is launched.

Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10  1:34         ` Huang, Kai
@ 2023-10-10 16:49           ` Haitao Huang
  2023-10-11  0:51             ` Huang, Kai
  2023-10-11  1:14             ` Huang, Kai
  0 siblings, 2 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-10 16:49 UTC (permalink / raw)
  To: Christopherson,, Sean, Huang, Kai
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote:
>> On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
>> > On Mon, Oct 09, 2023, Kai Huang wrote:
>> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> > > > +/**
>> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
>> LRU
>> > > > + * @lru:	LRU that is low
>> > > > + *
>> > > > + * Return:	%true if a victim was found and kicked.
>> > > > + */
>> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
>> > > > +{
>> > > > +	struct sgx_epc_page *victim;
>> > > > +
>> > > > +	spin_lock(&lru->lock);
>> > > > +	victim = sgx_oom_get_victim(lru);
>> > > > +	spin_unlock(&lru->lock);
>> > > > +
>> > > > +	if (!victim)
>> > > > +		return false;
>> > > > +
>> > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
>> > > > +		return sgx_oom_encl_page(victim->encl_page);
>> > > > +
>> > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
>> > > > +		return sgx_oom_encl(victim->encl);
>> > >
>> > > I hate to bring this up, at least at this stage, but I am wondering  
>> why we need
>> > > to put VA and SECS pages to the unreclaimable list, but cannot keep  
>> an
>> > > "enclave_list" instead?
>> >
>> > The motivation for tracking EPC pages instead of enclaves was so that  
>> the EPC
>> > OOM-killer could "kill" VMs as well as host-owned enclaves. >
>>
>> Ah this seems a fair argument. :-)
>>
>> > The virtual EPC code
>> > didn't actually kill the VM process, it instead just freed all of the  
>> EPC pages
>> > and abused the SGX architecture to effectively make the guest  
>> recreate all its
>> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
>>
>> It returns SIGBUS.  SGX VM live migration also requires enough EPC  
>> being able to
>> be allocated on the destination machine to work AFAICT.
>>
>> >
>> > Looks like y'all punted on that with:
>> >
>> >   The EPC pages allocated for KVM guests by the virtual EPC driver  
>> are not
>> >   reclaimable by the host kernel [5]. Therefore they are not tracked  
>> by any
>> >   LRU lists for reclaiming purposes in this implementation, but they  
>> are
>> >   charged toward the cgroup of the user processs (e.g., QEMU)  
>> launching the
>> >   guest.  And when the cgroup EPC usage reaches its limit, the  
>> virtual EPC
>> >   driver will stop allocating more EPC for the VM, and return SIGBUS  
>> to the
>> >   user process which would abort the VM launch.
>> >
>> > which IMO is a hack, unless returning SIGBUS is actually enforced  
>> somehow. >
>>
>> "enforced" do you mean?
>>
>> Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot  
>> allocate
>> EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in  
>> hva_to_pfn(),
>> which eventually results in KVM returning -EFAULT to userspace in  
>> vcpu_run().
>> And Qemu then kills the VM with some nonsense message:
>>
>>         error: kvm run failed Bad address
>>         <dump guest registers nonsense>
>>
>> > Relying
>> > on userspace to be kind enough to kill its VMs kinda defeats the  
>> purpose of cgroup
>> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
>> userspace running
>> > encalves in a VM could continue on and refuse to give up its EPC, and  
>> thus run above
>> > its limit in perpetuity.
>>
>> >
>> > I can see userspace wanting to explicitly terminate the VM instead of  
>> "silently"
>> > the VM's enclaves, but that seems like it should be a knob in the  
>> virtual EPC
>> > code.
>
> I guess I slightly misunderstood your words.
>
> You mean we want to kill VM when the limit is set to be lower than  
> virtual EPC
> size.
>
> This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> EPC too?
>

That flag is set for enclaves, do you mean we set similar flag in vepc  
struct?

> In the sgx_vepc_fault(), we check this flag at early time and return  
> SIGBUS if
> it is set.
>
> But this also requires keeping virtual EPC pages in some list, and  
> handles them
> in sgx_epc_oom() too.
>
> And for virtual EPC pages, I guess the "young" logic can be applied thus
> probably it's better to keep the actual virtual EPC pages to a  
> (separate?) list
> instead of keeping the virtual EPC instance.
>
> 	struct sgx_epc_lru {
> 		struct list_head reclaimable;
> 		struct sgx_encl *enclaves;
> 		struct list_head vepc_pages;
> 	}
>
> Or still tracking VA/SECS and virtual EPC pages in a single  
> unrecliamable list?
>

One LRU should be OK as we only need relative order in which they are  
loaded?
If an VA page is in front of vEPC, we just kill host side enclave first  
before disrupting VMs in the same group.
As the group is not in a good situation anyway so kernel just pick  
something reasonable to force kill.

Also after rereading the sentences "The virtual EPC code didn't actually  
kill the VM process, it instead just freed all of the  EPC pages and  
abused the SGX architecture to effectively make the guest  recreate all  
its enclaves..."

Maybe by "kill" vm, Sean means EREMOVE the vepc pages in the unreclaimable  
LRU, which effectively make enclaves in guest receiving "EPC lost" error  
and those enclaves are forced to be reloaded (all reasonable user space  
impl should already handle that). Not sure about free *all* of EPC pages  
though. we should just EREMOVE enough to bring down the usage. And disable  
new allocation in sgx_vepc_fault as kai outlined above. It also means user  
space needs to inject/pass the SIGBUS to guest (I'm not really familiar  
with this part, or maybe it's already there?). @sean is that what you  
mean? Sorry I've been slow on understanding this.

If this is the case, some may still think it's too disruptive to guest  
because the guest did not have a chance to paging out less active enclave  
pages. But it's user's limit to set so it is justifiable as long as we  
document this behavior.

Thanks to both of you for great insights.

Haitao



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10  2:12           ` Huang, Kai
@ 2023-10-10 17:05             ` Haitao Huang
  2023-10-11  0:31               ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-10 17:05 UTC (permalink / raw)
  To: tj, linux-sgx, dave.hansen, x86, cgroups, hpa, mingo,
	linux-kernel, bp, tglx, jarkko, Mehta, Sohil, Huang, Kai
  Cc: kristen, Zhang, Bo, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, anakrish, yangjie

On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai <kai.huang@intel.com> wrote:

>
>> > > >
>> > > Later the hosting process could migrated/reassigned to another  
>> cgroup?
>> > > What to do when the new cgroup is OOM?
>> > >
>> >
>> > You addressed in the documentation, no?
>> >
>> > +Migration
>> > +---------
>> > +
>> > +Once an EPC page is charged to a cgroup (during allocation), it
>> > +remains charged to the original cgroup until the page is released
>> > +or reclaimed.  Migrating a process to a different cgroup doesn't
>> > +move the EPC charges that it incurred while in the previous cgroup
>> > +to its new cgroup.
>>
>> Should we kill the enclave though because some VA pages may be in the  
>> new
>> group?
>>
>
> I guess acceptable?
>
> And any difference if you keep VA/SECS to unreclaimabe list?

Tracking VA/SECS allows all cgroups, in which an enclave has allocation,  
to identify the enclave following the back pointer and kill it as needed.

> If you migrate one
> enclave to another cgroup, the old EPC pages stay in the old cgroup  
> while the
> new one is charged to the new group IIUC.
>
> I am not cgroup expert, but by searching some old thread it appears this  
> isn't a
> supported model:
>
> https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/
>

IIUC it's a different problem here. If we don't track the allocated VAs in  
the new group, then the enclave that spans the two groups can't be killed  
by the new group. If so, some enclave could just hide in some small group  
and never gets killed but keeps allocating in a different group?

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10 13:26           ` Haitao Huang
@ 2023-10-11  0:01             ` Sean Christopherson
  2023-10-11 15:02               ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Sean Christopherson @ 2023-10-11  0:01 UTC (permalink / raw)
  To: Haitao Huang
  Cc: Kai Huang, Bo Zhang, linux-sgx, cgroups, yangjie, Zhiquan1 Li,
	dave.hansen, linux-kernel, mingo, tglx, tj, anakrish, jarkko,
	hpa, mikko.ylinen, Sohil Mehta, bp, x86, kristen

On Tue, Oct 10, 2023, Haitao Huang wrote:
> On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
> > > Hi Sean
> > > 
> > > On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson
> > > <seanjc@google.com> wrote:
> > > > I can see userspace wanting to explicitly terminate the VM instead of
> > > > "silently"
> > > > the VM's enclaves, but that seems like it should be a knob in the
> > > > virtual EPC
> > > > code.
> > > 
> > > If my understanding above is correct and understanding your statement
> > > above correctly, then don't see we really need separate knob for vEPC
> > > code. Reaching a cgroup limit by a running guest (assuming dynamic
> > > allocation implemented) should not translate automatically killing
> > > the VM.
> > > Instead, it's user space job to work with guest to handle allocation
> > > failure. Guest could page and kill enclaves.
> > > 
> > 
> > IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:
> > 
> > 1) misc.max = 100M
> > 2) Launch VMs with total virtual EPC size = 100M	<- success
> > 3) misc.max = 50M
> > 
> > 3) will also succeed, but nothing will happen, the VMs will be still
> > holding 100M EPC.
> > 
> > You need to somehow track virtual EPC and kill VM instead.
> > 
> > (or somehow fail to do 3) if it is also an acceptable option.)
> > 
> Thanks for explaining it.
> 
> There is an error code to return from max_write. I can add that too to the
> callback definition and fail it when it can't be enforced for any reason.
> Would like some community feedback if this is acceptable though.

That likely isn't acceptable.  E.g. create a cgroup with both a host enclave and
virtual EPC, set the hard limit to 100MiB.  Virtual EPC consumes 50MiB, and the
host enclave consumes 50MiB.  Userspace lowers the limit to 49MiB.  The cgroup
code would reclaim all of the enclave's reclaimable EPC, and then kill the enclave
because it's still over the limit.  And then fail the max_write because the cgroup
is *still* over the limit.  So in addition to burning a lot of cycles, from
userspace's perspective its enclave was killed for no reason, as the new limit
wasn't actually set.

> I think to solve it ultimately, we need be able to adjust 'capacity' of VMs
> not to just kill them, which is basically the same as dynamic allocation
> support for VMs (being able to increase/decrease epc size when it is
> running). For now, we only have static allocation so max can't be enforced
> once it is launched.

No, reclaiming virtual EPC is not a requirement.  VMM EPC oversubscription is
insanely complex, and I highly doubt any users actually want to oversubcribe VMs.

There are use cases for cgroups beyond oversubscribing/swapping, e.g. privileged
userspace may set limits on a container to ensure the container doesn't *accidentally*
consume more EPC than it was allotted, e.g. due to a configuration bug that created
a VM with more EPC than it was supposed to have.  

My comments on virtual EPC vs. cgroups is much more about having sane, well-defined
behavior, not about saying the kernel actually needs to support oversubscribing EPC
for KVM guests.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10 17:05             ` Haitao Huang
@ 2023-10-11  0:31               ` Huang, Kai
  2023-10-11 16:04                 ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-11  0:31 UTC (permalink / raw)
  To: Mehta, Sohil, linux-sgx, x86, dave.hansen, cgroups, hpa, mingo,
	tj, bp, haitao.huang, tglx, jarkko, linux-kernel
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Tue, 2023-10-10 at 12:05 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > 
> > > > > > 
> > > > > Later the hosting process could migrated/reassigned to another  
> > > cgroup?
> > > > > What to do when the new cgroup is OOM?
> > > > > 
> > > > 
> > > > You addressed in the documentation, no?
> > > > 
> > > > +Migration
> > > > +---------
> > > > +
> > > > +Once an EPC page is charged to a cgroup (during allocation), it
> > > > +remains charged to the original cgroup until the page is released
> > > > +or reclaimed.  Migrating a process to a different cgroup doesn't
> > > > +move the EPC charges that it incurred while in the previous cgroup
> > > > +to its new cgroup.
> > > 
> > > Should we kill the enclave though because some VA pages may be in the  
> > > new
> > > group?
> > > 
> > 
> > I guess acceptable?
> > 
> > And any difference if you keep VA/SECS to unreclaimabe list?
> 
> Tracking VA/SECS allows all cgroups, in which an enclave has allocation,  
> to identify the enclave following the back pointer and kill it as needed.
> 
> > If you migrate one
> > enclave to another cgroup, the old EPC pages stay in the old cgroup  
> > while the
> > new one is charged to the new group IIUC.
> > 
> > I am not cgroup expert, but by searching some old thread it appears this  
> > isn't a
> > supported model:
> > 
> > https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/
> > 
> 
> IIUC it's a different problem here. If we don't track the allocated VAs in  
> the new group, then the enclave that spans the two groups can't be killed  
> by the new group. If so, some enclave could just hide in some small group  
> and never gets killed but keeps allocating in a different group?
> 

I mean from the link above IIUC migrating enclave among different cgroups simply
isn't a supported model, thus any bad behaviour isn't a big concern in terms of
decision making.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10 16:49           ` Haitao Huang
@ 2023-10-11  0:51             ` Huang, Kai
  2023-10-12 13:27               ` Haitao Huang
  2023-10-11  1:14             ` Huang, Kai
  1 sibling, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-11  0:51 UTC (permalink / raw)
  To: Christopherson,, Sean, haitao.huang
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote:
> > > On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> > > > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > > > +/**
> > > > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
> > > LRU
> > > > > > + * @lru:	LRU that is low
> > > > > > + *
> > > > > > + * Return:	%true if a victim was found and kicked.
> > > > > > + */
> > > > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > > > +{
> > > > > > +	struct sgx_epc_page *victim;
> > > > > > +
> > > > > > +	spin_lock(&lru->lock);
> > > > > > +	victim = sgx_oom_get_victim(lru);
> > > > > > +	spin_unlock(&lru->lock);
> > > > > > +
> > > > > > +	if (!victim)
> > > > > > +		return false;
> > > > > > +
> > > > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > > > +		return sgx_oom_encl_page(victim->encl_page);
> > > > > > +
> > > > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > > > +		return sgx_oom_encl(victim->encl);
> > > > > 
> > > > > I hate to bring this up, at least at this stage, but I am wondering  
> > > why we need
> > > > > to put VA and SECS pages to the unreclaimable list, but cannot keep  
> > > an
> > > > > "enclave_list" instead?
> > > > 
> > > > The motivation for tracking EPC pages instead of enclaves was so that  
> > > the EPC
> > > > OOM-killer could "kill" VMs as well as host-owned enclaves. >
> > > 
> > > Ah this seems a fair argument. :-)
> > > 
> > > > The virtual EPC code
> > > > didn't actually kill the VM process, it instead just freed all of the  
> > > EPC pages
> > > > and abused the SGX architecture to effectively make the guest  
> > > recreate all its
> > > > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> > > 
> > > It returns SIGBUS.  SGX VM live migration also requires enough EPC  
> > > being able to
> > > be allocated on the destination machine to work AFAICT.
> > > 
> > > > 
> > > > Looks like y'all punted on that with:
> > > > 
> > > >   The EPC pages allocated for KVM guests by the virtual EPC driver  
> > > are not
> > > >   reclaimable by the host kernel [5]. Therefore they are not tracked  
> > > by any
> > > >   LRU lists for reclaiming purposes in this implementation, but they  
> > > are
> > > >   charged toward the cgroup of the user processs (e.g., QEMU)  
> > > launching the
> > > >   guest.  And when the cgroup EPC usage reaches its limit, the  
> > > virtual EPC
> > > >   driver will stop allocating more EPC for the VM, and return SIGBUS  
> > > to the
> > > >   user process which would abort the VM launch.
> > > > 
> > > > which IMO is a hack, unless returning SIGBUS is actually enforced  
> > > somehow. >
> > > 
> > > "enforced" do you mean?
> > > 
> > > Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot  
> > > allocate
> > > EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in  
> > > hva_to_pfn(),
> > > which eventually results in KVM returning -EFAULT to userspace in  
> > > vcpu_run().
> > > And Qemu then kills the VM with some nonsense message:
> > > 
> > >         error: kvm run failed Bad address
> > >         <dump guest registers nonsense>
> > > 
> > > > Relying
> > > > on userspace to be kind enough to kill its VMs kinda defeats the  
> > > purpose of cgroup
> > > > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> > > userspace running
> > > > encalves in a VM could continue on and refuse to give up its EPC, and  
> > > thus run above
> > > > its limit in perpetuity.
> > > 
> > > > 
> > > > I can see userspace wanting to explicitly terminate the VM instead of  
> > > "silently"
> > > > the VM's enclaves, but that seems like it should be a knob in the  
> > > virtual EPC
> > > > code.
> > 
> > I guess I slightly misunderstood your words.
> > 
> > You mean we want to kill VM when the limit is set to be lower than  
> > virtual EPC
> > size.
> > 
> > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > EPC too?
> > 
> 
> That flag is set for enclaves, do you mean we set similar flag in vepc  
> struct?
> 
> > In the sgx_vepc_fault(), we check this flag at early time and return  
> > SIGBUS if
> > it is set.
> > 
> > But this also requires keeping virtual EPC pages in some list, and  
> > handles them
> > in sgx_epc_oom() too.
> > 
> > And for virtual EPC pages, I guess the "young" logic can be applied thus
> > probably it's better to keep the actual virtual EPC pages to a  
> > (separate?) list
> > instead of keeping the virtual EPC instance.
> > 
> > 	struct sgx_epc_lru {
> > 		struct list_head reclaimable;
> > 		struct sgx_encl *enclaves;
> > 		struct list_head vepc_pages;
> > 	}
> > 
> > Or still tracking VA/SECS and virtual EPC pages in a single  
> > unrecliamable list?
> > 
> 
> One LRU should be OK as we only need relative order in which they are  
> loaded?

It's one way to do, but not the only way.

The disadvantage of using one unreclaimable list is, for VA/SECS pages you don't
have the concept of "young/age".  The only purpose of getting the page is to get
the owner enclave and kill it.

On the other hand, for virtual EPC pages we do have concept of "young/age",
although I think it's acceptable we don't implement this in the first version of
EPC cgroup support.

From this point, perhaps it's better to  maintain VA/SECS (or enclaves)
separately from virtual EPC pages.  But it is not mandatory.

Another pro of maintaining them separately is you don't need these flags to
distinguish them from the single list: SGX_EPC_OWNER_{ENCL|PAGE|VEPC}, which I
dislike.

(btw, even you track VA/SECS pages in unreclaimable list, given they both have
'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
SGX_EPC_OWNER_PAGE ?)

That being said, personally I'd slightly prefer to keep them in separate lists
but I can see it also depends on how you want to implement the algorithm of
selecting a victim.  So I am not going to insist for now but will leave to you
to decide.

Just remember to give justification of choosing so.

[...]

> It also means user  
> space needs to inject/pass the SIGBUS to guest (I'm not really familiar  
> with this part, or maybe it's already there?). @sean is that what you  
> mean? Sorry I've been slow on understanding this.

I don't think we need to do this for now.  Killing the VM is an acceptable start
to me.  Just make sure we can kill some VM.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-10 16:49           ` Haitao Huang
  2023-10-11  0:51             ` Huang, Kai
@ 2023-10-11  1:14             ` Huang, Kai
  2023-10-16 11:02               ` Huang, Kai
  1 sibling, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-11  1:14 UTC (permalink / raw)
  To: Christopherson,, Sean, haitao.huang
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> > 
> > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > EPC too?
> > 
> 
> That flag is set for enclaves, do you mean we set similar flag in vepc  
> struct?

Yes.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-11  0:01             ` Sean Christopherson
@ 2023-10-11 15:02               ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-11 15:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, Bo Zhang, linux-sgx, cgroups, yangjie, Zhiquan1 Li,
	dave.hansen, linux-kernel, mingo, tglx, tj, anakrish, jarkko,
	hpa, mikko.ylinen, Sohil Mehta, bp, x86, kristen

On Tue, 10 Oct 2023 19:01:25 -0500, Sean Christopherson  
<seanjc@google.com> wrote:

> On Tue, Oct 10, 2023, Haitao Huang wrote:
>> On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
>> > > Hi Sean
>> > >
>> > > On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson
>> > > <seanjc@google.com> wrote:
>> > > > I can see userspace wanting to explicitly terminate the VM  
>> instead of
>> > > > "silently"
>> > > > the VM's enclaves, but that seems like it should be a knob in the
>> > > > virtual EPC
>> > > > code.
>> > >
>> > > If my understanding above is correct and understanding your  
>> statement
>> > > above correctly, then don't see we really need separate knob for  
>> vEPC
>> > > code. Reaching a cgroup limit by a running guest (assuming dynamic
>> > > allocation implemented) should not translate automatically killing
>> > > the VM.
>> > > Instead, it's user space job to work with guest to handle allocation
>> > > failure. Guest could page and kill enclaves.
>> > >
>> >
>> > IIUC Sean was talking about changing misc.max _after_ you launch SGX  
>> VMs:
>> >
>> > 1) misc.max = 100M
>> > 2) Launch VMs with total virtual EPC size = 100M	<- success
>> > 3) misc.max = 50M
>> >
>> > 3) will also succeed, but nothing will happen, the VMs will be still
>> > holding 100M EPC.
>> >
>> > You need to somehow track virtual EPC and kill VM instead.
>> >
>> > (or somehow fail to do 3) if it is also an acceptable option.)
>> >
>> Thanks for explaining it.
>>
>> There is an error code to return from max_write. I can add that too to  
>> the
>> callback definition and fail it when it can't be enforced for any  
>> reason.
>> Would like some community feedback if this is acceptable though.
>
> That likely isn't acceptable.  E.g. create a cgroup with both a host  
> enclave and
> virtual EPC, set the hard limit to 100MiB.  Virtual EPC consumes 50MiB,  
> and the
> host enclave consumes 50MiB.  Userspace lowers the limit to 49MiB.  The  
> cgroup
> code would reclaim all of the enclave's reclaimable EPC, and then kill  
> the enclave
> because it's still over the limit.  And then fail the max_write because  
> the cgroup
> is *still* over the limit.  So in addition to burning a lot of cycles,  
> from
> userspace's perspective its enclave was killed for no reason, as the new  
> limit
> wasn't actually set.

I was thinking before reclaiming enclave pages, if we know the untracked  
vepc pages (current-tracked) is larger than limit, we just return error  
without enforcing the limit. That way user also knows something is wrong.

But I get that we want to be able to kill VMs to enforce the newer lower  
limit. I assume we can't optimize efficiency/priority of killing: enclave  
pages will be reclaimed first no matter what just because they are  
reclaimable; no good rules to choose victim between enclave and VMs in  
your example so enclaves could be killed still before VMs.

>> I think to solve it ultimately, we need be able to adjust 'capacity' of  
>> VMs
>> not to just kill them, which is basically the same as dynamic allocation
>> support for VMs (being able to increase/decrease epc size when it is
>> running). For now, we only have static allocation so max can't be  
>> enforced
>> once it is launched.
>
> No, reclaiming virtual EPC is not a requirement.  VMM EPC  
> oversubscription is
> insanely complex, and I highly doubt any users actually want to  
> oversubcribe VMs.
>
> There are use cases for cgroups beyond oversubscribing/swapping, e.g.  
> privileged
> userspace may set limits on a container to ensure the container doesn't  
> *accidentally*
> consume more EPC than it was allotted, e.g. due to a configuration bug  
> that created
> a VM with more EPC than it was supposed to have.
>
> My comments on virtual EPC vs. cgroups is much more about having sane,  
> well-defined
> behavior, not about saying the kernel actually needs to support  
> oversubscribing EPC
> for KVM guests.

So here are the steps I see now, let me know if this is aligned with your  
thinking or I'm off track.

0) when all left are enclave VA, SECS pages and VEPC in a cgroup, the OOM  
kill process starts.
1) The cgroup identifies a victim vepc for OOM kill(this could be before  
or after enclaves being killed), sets a new flag VEPC_NO_MEMORY in the  
vepc.
2) call sgx_vepc_remove_all(), ignore failure counts returned. This will  
perform best effort to eremove all normal pages used by the vepc.
3) Guest may trigger fault again, return SIGBUS in sgx_vepc_fault when  
VEPC_NO_MEMORY is set. Do the same for mmap.
4) That would eventually cause sgx_vepc_release() before VM dies or killed  
by user space, which does the final cleanup.

Q: should we reset VEPC_NO_MEMORY flag if #4 never happens and cgroup  
usage is below limit? I suppose we can do it, but not sure it is sane  
because VM would try to load as much pages as configured originally.

I'm still thinking about using one unreclaimable list, justification is  
simplicity and lack of rules to select items from separate lists, but may  
change my mind if I found it's inconvenient.

Not sure how age/youngness can be accounted for guest pages though Kai  
indicated we don't need that for first version. So I assume we can deal  
with it later and add separate list for vEPC if needed
for that reason.

Thanks a lot for your input.
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-11  0:31               ` Huang, Kai
@ 2023-10-11 16:04                 ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-11 16:04 UTC (permalink / raw)
  To: Mehta, Sohil, linux-sgx, x86, dave.hansen, cgroups, hpa, mingo,
	tj, bp, tglx, jarkko, linux-kernel, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Tue, 10 Oct 2023 19:31:19 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Tue, 2023-10-10 at 12:05 -0500, Haitao Huang wrote:
>> On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> >
>> > > > > >
>> > > > > Later the hosting process could migrated/reassigned to another
>> > > cgroup?
>> > > > > What to do when the new cgroup is OOM?
>> > > > >
>> > > >
>> > > > You addressed in the documentation, no?
>> > > >
>> > > > +Migration
>> > > > +---------
>> > > > +
>> > > > +Once an EPC page is charged to a cgroup (during allocation), it
>> > > > +remains charged to the original cgroup until the page is released
>> > > > +or reclaimed.  Migrating a process to a different cgroup doesn't
>> > > > +move the EPC charges that it incurred while in the previous  
>> cgroup
>> > > > +to its new cgroup.
>> > >
>> > > Should we kill the enclave though because some VA pages may be in  
>> the
>> > > new
>> > > group?
>> > >
>> >
>> > I guess acceptable?
>> >
>> > And any difference if you keep VA/SECS to unreclaimabe list?
>>
>> Tracking VA/SECS allows all cgroups, in which an enclave has allocation,
>> to identify the enclave following the back pointer and kill it as  
>> needed.
>>
>> > If you migrate one
>> > enclave to another cgroup, the old EPC pages stay in the old cgroup
>> > while the
>> > new one is charged to the new group IIUC.
>> >
>> > I am not cgroup expert, but by searching some old thread it appears  
>> this
>> > isn't a
>> > supported model:
>> >
>> > https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/
>> >
>>
>> IIUC it's a different problem here. If we don't track the allocated VAs  
>> in
>> the new group, then the enclave that spans the two groups can't be  
>> killed
>> by the new group. If so, some enclave could just hide in some small  
>> group
>> and never gets killed but keeps allocating in a different group?
>>
>
> I mean from the link above IIUC migrating enclave among different  
> cgroups simply
> isn't a supported model, thus any bad behaviour isn't a big concern in  
> terms of
> decision making.

If we leave some pages in a cgroup unkillable, we are in the same  
situation of not able to enforce a cgroup limit as that we are are in if  
we don't kill VMs for lower limits.

I think not supporting migration of pages between cgroups should not leave  
a gap for enforcement just like we don't want to have an enforcement gap  
if we let VMs to hold pages once it is launched.

Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-11  0:51             ` Huang, Kai
@ 2023-10-12 13:27               ` Haitao Huang
  2023-10-16 10:57                 ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-12 13:27 UTC (permalink / raw)
  To: Christopherson,, Sean, Huang, Kai
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai <kai.huang@intel.com> wrote:
[...]
> (btw, even you track VA/SECS pages in unreclaimable list, given they  
> both have
> 'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
> SGX_EPC_OWNER_PAGE ?)

Let me think about it, there might be also a way just track encl objects  
not unreclaimable pages.

I still not get why we need kill the VM not just remove just enough pages.  
Is it due to the static allocation not able to reclaim?


If we always remove all vEPC pages/kill VM, then we should not need track  
individual vepc pages.

Thanks

Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-12 13:27               ` Haitao Huang
@ 2023-10-16 10:57                 ` Huang, Kai
  2023-10-16 19:52                   ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-16 10:57 UTC (permalink / raw)
  To: Christopherson,, Sean, haitao.huang
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Thu, 2023-10-12 at 08:27 -0500, Haitao Huang wrote:
> On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> [...]
> > (btw, even you track VA/SECS pages in unreclaimable list, given they  
> > both have
> > 'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
> > SGX_EPC_OWNER_PAGE ?)
> 
> Let me think about it, there might be also a way just track encl objects  
> not unreclaimable pages.
> 
> I still not get why we need kill the VM not just remove just enough pages.  
> Is it due to the static allocation not able to reclaim?

We can choose to "just remove enough EPC pages".  The VM may or may not be
killed when it wants the EPC pages back, depending on whether the current EPC
cgroup can provide enough EPC pages or not.  And this depends on how we
implement the cgroup algorithm to reclaim EPC pages.

One problem could be: for a EPC cgroup only has SGX VMs, you may end up with
moving EPC pages from one VM to another and then vice versa endlessly, because
you never really actually mark any VM to be dead just like OOM does to the
normal enclaves.

From this point, you still need a way to kill a VM, IIUC.

I think the key point of virtual EPC vs cgroup, as quoted from Sean, should be
"having sane, well-defined behavior".

Does "just remove enough EPC pages" meet this?  If the problem mentioned above
can be avoided, I suppose so?  So if there's an easy way to achieve, I guess it
can be an option too.

But for the initial support, IMO we are not looking for a perfect but yet
complicated solution.  I would say, from review's point of view, it's preferred
to have a simple implementation to achieve a not-prefect, but consistent, well-
defined behaviour.

So to me looks killing the VM when cgroup cannot reclaim any more EPC pages is a
simple option.

But I might have missed something, especially since middle of last week I have
been having fever and headache :-)

So as mentioned above, you can try other alternatives, but please avoid
complicated ones.

Also, I guess it will be helpful if we can understand the typical SGX app and/or
SGX VM deployment under EPC cgroup use case.  This may help us on justifying why
the EPC cgroup algorithm to select victim is reasonable.



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-11  1:14             ` Huang, Kai
@ 2023-10-16 11:02               ` Huang, Kai
  0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-10-16 11:02 UTC (permalink / raw)
  To: Christopherson,, Sean, haitao.huang
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Wed, 2023-10-11 at 01:14 +0000, Huang, Kai wrote:
> On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> > > 
> > > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > > EPC too?
> > > 
> > 
> > That flag is set for enclaves, do you mean we set similar flag in vepc  
> > struct?
> 
> Yes.

I missed the "ENCL" part but only noted the "NO_MEMORY" part, so I guess it
should not be used directly for vEPC.  So if it is needed, SGX_VEPC_NO_MEMORY,
or a simple 'bool dead' or similar in 'struct sgx_vepc' is more suitable.

As I said I was fighting with fever and headache last week :-) 

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-16 10:57                 ` Huang, Kai
@ 2023-10-16 19:52                   ` Haitao Huang
  2023-10-16 21:09                     ` Huang, Kai
  2023-10-16 21:32                     ` Sean Christopherson
  0 siblings, 2 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-16 19:52 UTC (permalink / raw)
  To: Christopherson,, Sean, Huang, Kai
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Mon, 16 Oct 2023 05:57:36 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Thu, 2023-10-12 at 08:27 -0500, Haitao Huang wrote:
>> On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>> [...]
>> > (btw, even you track VA/SECS pages in unreclaimable list, given they
>> > both have
>> > 'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
>> > SGX_EPC_OWNER_PAGE ?)
>>
>> Let me think about it, there might be also a way just track encl objects
>> not unreclaimable pages.
>>
>> I still not get why we need kill the VM not just remove just enough  
>> pages.
>> Is it due to the static allocation not able to reclaim?
>
> We can choose to "just remove enough EPC pages".  The VM may or may not  
> be
> killed when it wants the EPC pages back, depending on whether the  
> current EPC
> cgroup can provide enough EPC pages or not.  And this depends on how we
> implement the cgroup algorithm to reclaim EPC pages.
>
> One problem could be: for a EPC cgroup only has SGX VMs, you may end up  
> with
> moving EPC pages from one VM to another and then vice versa endlessly,

This could be a valid use case though if people intend to share EPCs  
between two VMs. Understand no one would be able to use VMs this way  
currently with the static allocation.

> because
> you never really actually mark any VM to be dead just like OOM does to  
> the
> normal enclaves.
>
> From this point, you still need a way to kill a VM, IIUC.
>
> I think the key point of virtual EPC vs cgroup, as quoted from Sean,  
> should be
> "having sane, well-defined behavior".
>
> Does "just remove enough EPC pages" meet this?  If the problem mentioned  
> above
> can be avoided, I suppose so?  So if there's an easy way to achieve, I  
> guess it
> can be an option too.
>
> But for the initial support, IMO we are not looking for a perfect but yet
> complicated solution.  I would say, from review's point of view, it's  
> preferred
> to have a simple implementation to achieve a not-prefect, but  
> consistent, well-
> defined behaviour.
>
> So to me looks killing the VM when cgroup cannot reclaim any more EPC  
> pages is a
> simple option.
>
> But I might have missed something, especially since middle of last week  
> I have
> been having fever and headache :-)
>
> So as mentioned above, you can try other alternatives, but please avoid
> complicated ones.
>
> Also, I guess it will be helpful if we can understand the typical SGX  
> app and/or
> SGX VM deployment under EPC cgroup use case.  This may help us on  
> justifying why
> the EPC cgroup algorithm to select victim is reasonable.
>


 From this perspective, I think the current implementation is  
"well-defined": EPC cgroup limits for VMs are only enforced at VM launch  
time, not runtime. In practice,  SGX VM can be launched only with fixed  
EPC size and all those EPCs are fully committed to the VM once launched.  
Because of that, I imagine people are using VMs to primarily partition the  
physical EPCs, i.e, the static size itself is the 'limit' for the workload  
of a single VM and not expecting EPCs taken away at runtime.

So killing does not really add much value for the existing usages IIUC.

That said, I don't anticipate adding the enforcement of killing VMs at  
runtime would break such usages as admin/user can simply choose to set the  
limit equal to the static size to launch the VM and forget about it.

Given that, I'll propose an add-on patch to this series as RFC and have  
some feedback from community before we decide if that needs be included in  
first version or we can skip it until we have EPC reclaiming for VMs.

Thanks
Haitao
 

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-16 19:52                   ` Haitao Huang
@ 2023-10-16 21:09                     ` Huang, Kai
  2023-10-17  0:10                       ` Haitao Huang
  2023-10-16 21:32                     ` Sean Christopherson
  1 sibling, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-16 21:09 UTC (permalink / raw)
  To: Christopherson,, Sean, haitao.huang
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

> 
> 
>  From this perspective, I think the current implementation is  
> "well-defined": EPC cgroup limits for VMs are only enforced at VM launch  
> time, not runtime. In practice,  SGX VM can be launched only with fixed  
> EPC size and all those EPCs are fully committed to the VM once launched.  
> Because of that, I imagine people are using VMs to primarily partition the  
> physical EPCs, i.e, the static size itself is the 'limit' for the workload  
> of a single VM and not expecting EPCs taken away at runtime.
> 
> So killing does not really add much value for the existing usages IIUC.

It's not about adding value to the existing usages, it's about fixing the issue
when we lower the EPC limit to a point that is less than total virtual EPC size.

It's a design issue, or simply a bug in the current implementation we need to
fix.

> 
> That said, I don't anticipate adding the enforcement of killing VMs at  
> runtime would break such usages as admin/user can simply choose to set the  
> limit equal to the static size to launch the VM and forget about it.
> 
> Given that, I'll propose an add-on patch to this series as RFC and have  
> some feedback from community before we decide if that needs be included in  
> first version or we can skip it until we have EPC reclaiming for VMs.

I don't understand what is the "add-on" patch you are talking about.

I mentioned the "typical deployment thing" is that can help us understand which
algorithm is better way to select the victim.  But no matter what we choose, we
still need to fix the bug mentioned above here.

I really think you should just go this simple way: 

When you want to take EPC back from VM, kill the VM.

Another bad thing about "just removing EPC pages from VM" is the enclaves in the
VM would suffer "sudden lose of EPC", or even worse, suffer it at a high
frequency.  Although we depend on that for supporting SGX VM live migration, but
that needs to avoided if possible.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-16 19:52                   ` Haitao Huang
  2023-10-16 21:09                     ` Huang, Kai
@ 2023-10-16 21:32                     ` Sean Christopherson
  2023-10-17  0:09                       ` Haitao Huang
  2023-10-17 11:49                       ` Mikko Ylinen
  1 sibling, 2 replies; 144+ messages in thread
From: Sean Christopherson @ 2023-10-16 21:32 UTC (permalink / raw)
  To: Haitao Huang
  Cc: Kai Huang, Bo Zhang, linux-sgx, cgroups, yangjie, dave.hansen,
	Zhiquan1 Li, linux-kernel, mingo, tglx, tj, anakrish, jarkko,
	hpa, mikko.ylinen, Sohil Mehta, bp, x86, kristen

On Mon, Oct 16, 2023, Haitao Huang wrote:
> From this perspective, I think the current implementation is "well-defined":
> EPC cgroup limits for VMs are only enforced at VM launch time, not runtime.
> In practice,  SGX VM can be launched only with fixed EPC size and all those
> EPCs are fully committed to the VM once launched.

Fully committed doesn't mean those numbers are reflected in the cgroup.  A VM
scheduler can easily "commit" EPC to a guest, but allocate EPC on demand, i.e.
when the guest attempts to actually access a page.  Preallocating memory isn't
free, e.g. it can slow down guest boot, so it's entirely reasonable to have virtual
EPC be allocated on-demand.  Enforcing at launch time doesn't work for such setups,
because from the cgroup's perspective, the VM is using 0 pages of EPC at launch.

> Because of that, I imagine people are using VMs to primarily partition the
> physical EPCs, i.e, the static size itself is the 'limit' for the workload of
> a single VM and not expecting EPCs taken away at runtime.

If everything goes exactly as planned, sure.  But it's not hard to imagine some
configuration change way up the stack resulting in the hard limit for an EPC cgroup
being lowered.

> So killing does not really add much value for the existing usages IIUC.

As I said earlier, the behavior doesn't have to result in terminating a VM, e.g.
the virtual EPC code could provide a knob to send a signal/notification if the
owning cgroup has gone above the limit and the VM is targeted for forced reclaim.

> That said, I don't anticipate adding the enforcement of killing VMs at
> runtime would break such usages as admin/user can simply choose to set the
> limit equal to the static size to launch the VM and forget about it.
> 
> Given that, I'll propose an add-on patch to this series as RFC and have some
> feedback from community before we decide if that needs be included in first
> version or we can skip it until we have EPC reclaiming for VMs.

Gracefully *swapping* virtual EPC isn't required for oversubscribing virtual EPC.
Think of it like airlines overselling tickets.  The airline sells more tickets
than they have seats, and banks on some passengers canceling.  If too many people
show up, the airline doesn't swap passengers to the cargo bay, they just shunt them
to a different plane.

The same could be easily be done for hosts and virtual EPC.  E.g. if every VM
*might* use 1GiB, but in practice 99% of VMs only consume 128MiB, then it's not
too crazy to advertise 1GiB to each VM, but only actually carve out 256MiB per VM
in order to pack more VMs on a host.  If the host needs to free up EPC, then the
most problematic VMs can be migrated to a different host.

Genuinely curious, who is asking for EPC cgroup support that *isn't* running VMs?
AFAIK, these days, SGX is primarily targeted at cloud.  I assume virtual EPC is
the primary use case for an EPC cgroup.

I don't have any skin in the game beyond my name being attached to some of the
patches, i.e. I certainly won't stand in the way.  I just don't understand why
you would go through all the effort of adding an EPC cgroup and then not go the
extra few steps to enforce limits for virtual EPC.  Compared to the complexity
of the rest of the series, that little bit seems quite trivial.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-16 21:32                     ` Sean Christopherson
@ 2023-10-17  0:09                       ` Haitao Huang
  2023-10-17 15:43                         ` Sean Christopherson
  2023-10-17 11:49                       ` Mikko Ylinen
  1 sibling, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-17  0:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, Bo Zhang, linux-sgx, cgroups, yangjie, dave.hansen,
	Zhiquan1 Li, linux-kernel, mingo, tglx, tj, anakrish, jarkko,
	hpa, mikko.ylinen, Sohil Mehta, bp, x86, kristen

Hi Sean

On Mon, 16 Oct 2023 16:32:31 -0500, Sean Christopherson  
<seanjc@google.com> wrote:

> On Mon, Oct 16, 2023, Haitao Huang wrote:
>> From this perspective, I think the current implementation is  
>> "well-defined":
>> EPC cgroup limits for VMs are only enforced at VM launch time, not  
>> runtime.
>> In practice,  SGX VM can be launched only with fixed EPC size and all  
>> those
>> EPCs are fully committed to the VM once launched.
>
> Fully committed doesn't mean those numbers are reflected in the cgroup.   
> A VM
> scheduler can easily "commit" EPC to a guest, but allocate EPC on  
> demand, i.e.
> when the guest attempts to actually access a page.  Preallocating memory  
> isn't
> free, e.g. it can slow down guest boot, so it's entirely reasonable to  
> have virtual
> EPC be allocated on-demand.  Enforcing at launch time doesn't work for  
> such setups,
> because from the cgroup's perspective, the VM is using 0 pages of EPC at  
> launch.
>
Maybe I understood the current implementation wrong. From what I see, vEPC  
is impossible not fully commit at launch time. The guest would EREMOVE all  
pages during initialization resulting #PF and all pages allocated. This  
essentially makes "prealloc=off" the same as "prealloc=on".
Unless you are talking about some custom OS or kernel other than upstream  
Linux here?

Thanks
Haitap

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-16 21:09                     ` Huang, Kai
@ 2023-10-17  0:10                       ` Haitao Huang
  2023-10-17  1:34                         ` Huang, Kai
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-17  0:10 UTC (permalink / raw)
  To: Christopherson,, Sean, Huang, Kai
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai <kai.huang@intel.com> wrote:
[...]

> still need to fix the bug mentioned above here.
>
> I really think you should just go this simple way:
>
> When you want to take EPC back from VM, kill the VM.
>

My only concern is that this is a compromise due to current limitation (no  
other sane way to take EPC from VMs). If we define this behavior and it  
becomes a contract to user space, then we can't change in future.

On the other hand, my understanding the reason you want this behavior is  
to enforce EPC limit at runtime. I just not sure how important it is and  
if it is a real usage given all limitations of SGX VMs we have (static EPC  
size, no migration).

Thanks

Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-17  0:10                       ` Haitao Huang
@ 2023-10-17  1:34                         ` Huang, Kai
  2023-10-17 12:58                           ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-10-17  1:34 UTC (permalink / raw)
  To: Christopherson,, Sean, haitao.huang
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Mon, 2023-10-16 at 19:10 -0500, Haitao Huang wrote:
> On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> [...]
> 
> > still need to fix the bug mentioned above here.
> > 
> > I really think you should just go this simple way:
> > 
> > When you want to take EPC back from VM, kill the VM.
> > 
> 
> My only concern is that this is a compromise due to current limitation (no  
> other sane way to take EPC from VMs). If we define this behavior and it  
> becomes a contract to user space, then we can't change in future.

Why do we need to "define such behaviour"?

This isn't some kinda of kernel/userspace ABI IMHO, but only kernel internal
implementation.  Here VM is similar to normal host enclaves.  You limit the
resource, some host enclaves could be killed.  Similarly, VM could also be
killed too.

And supporting VMM EPC oversubscription doesn't mean VM won't be killed.  The VM
can still be a target to kill after VM's all EPC pages have been swapped out.

> 
> On the other hand, my understanding the reason you want this behavior is  
> to enforce EPC limit at runtime. 
> 

No I just thought this is a bug/issue needs to be fixed.  If anyone believes
this is not a bug/issue then it's a separate discussion.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-16 21:32                     ` Sean Christopherson
  2023-10-17  0:09                       ` Haitao Huang
@ 2023-10-17 11:49                       ` Mikko Ylinen
  1 sibling, 0 replies; 144+ messages in thread
From: Mikko Ylinen @ 2023-10-17 11:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Haitao Huang, Kai Huang, Bo Zhang, linux-sgx, cgroups, yangjie,
	dave.hansen, Zhiquan1 Li, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, Sohil Mehta, bp, x86, kristen

On Mon, Oct 16, 2023 at 02:32:31PM -0700, Sean Christopherson wrote:
> Genuinely curious, who is asking for EPC cgroup support that *isn't* running VMs?

People who work with containers: [1], [2]. 

> AFAIK, these days, SGX is primarily targeted at cloud.  I assume virtual EPC is
> the primary use case for an EPC cgroup.

The common setup is that a cloud VM instance with vEPC is created and then
several SGX enclave containers are run simultaneously on that instance. EPC
cgroups is used to ensure that each container gets their own share of EPC
(and any attempts to go beyond the limit is reclaimed and charged from
the container's memcg). The same containers w/ enclaves use case is
applicable to baremetal also, though.

As far as Kubernetes orchestrated containers are concerned, "in-place" resource
scaling is still in very early stages which means that the cgroups values are
adjusted by *re-creating* the container. The hierarchies are also built
such that there's no mix of VMs w/ vEPC and enclaves in the same tree.

Mikko

[1] https://lore.kernel.org/linux-sgx/20221202183655.3767674-1-kristen@linux.intel.com/T/#m6d1c895534b4c0636f47c2d1620016b4c362bb9b
[2] https://lore.kernel.org/linux-sgx/20221202183655.3767674-1-kristen@linux.intel.com/T/#m37600e457b832feee6e8346aa74dcff8f21965f8

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-17  1:34                         ` Huang, Kai
@ 2023-10-17 12:58                           ` Haitao Huang
  2023-10-17 18:54                             ` Michal Koutný
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-17 12:58 UTC (permalink / raw)
  To: Christopherson,, Sean, Huang, Kai
  Cc: Zhang, Bo, linux-sgx, cgroups, yangjie, dave.hansen, Li,
	Zhiquan1, linux-kernel, mingo, tglx, tj, anakrish, jarkko, hpa,
	mikko.ylinen, Mehta, Sohil, bp, x86, kristen

On Mon, 16 Oct 2023 20:34:57 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-10-16 at 19:10 -0500, Haitao Huang wrote:
>> On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>> [...]
>>
>> > still need to fix the bug mentioned above here.
>> >
>> > I really think you should just go this simple way:
>> >
>> > When you want to take EPC back from VM, kill the VM.
>> >
>>
>> My only concern is that this is a compromise due to current limitation  
>> (no
>> other sane way to take EPC from VMs). If we define this behavior and it
>> becomes a contract to user space, then we can't change in future.
>
> Why do we need to "define such behaviour"?
>
> This isn't some kinda of kernel/userspace ABI IMHO, but only kernel  
> internal
> implementation.  Here VM is similar to normal host enclaves.  You limit  
> the
> resource, some host enclaves could be killed.  Similarly, VM could also  
> be
> killed too.
>
> And supporting VMM EPC oversubscription doesn't mean VM won't be  
> killed.  The VM
> can still be a target to kill after VM's all EPC pages have been swapped  
> out.
>
>>
>> On the other hand, my understanding the reason you want this behavior is
>> to enforce EPC limit at runtime.
>
> No I just thought this is a bug/issue needs to be fixed.  If anyone  
> believes
> this is not a bug/issue then it's a separate discussion.
>

AFAIK, before we introducing max_write() callback in this series, no misc  
controller would possibly enforce the limit when misc.max is reduced. e.g.  
I don't think CVMs be killed when ASID limit is reduced and the cgroup was  
full before limit is reduced.

I think EPC pages to VMs could have the same behavior, once they are given  
to a guest, never taken back by the host. For enclaves on host side, pages  
are reclaimable, that allows us to enforce in a similar way to memcg.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-17  0:09                       ` Haitao Huang
@ 2023-10-17 15:43                         ` Sean Christopherson
  0 siblings, 0 replies; 144+ messages in thread
From: Sean Christopherson @ 2023-10-17 15:43 UTC (permalink / raw)
  To: Haitao Huang
  Cc: Kai Huang, Bo Zhang, linux-sgx, cgroups, yangjie, dave.hansen,
	Zhiquan1 Li, linux-kernel, mingo, tglx, tj, anakrish, jarkko,
	hpa, mikko.ylinen, Sohil Mehta, bp, x86, kristen

On Mon, Oct 16, 2023, Haitao Huang wrote:
> Hi Sean
> 
> On Mon, 16 Oct 2023 16:32:31 -0500, Sean Christopherson <seanjc@google.com>
> wrote:
> 
> > On Mon, Oct 16, 2023, Haitao Huang wrote:
> > > From this perspective, I think the current implementation is
> > > "well-defined":
> > > EPC cgroup limits for VMs are only enforced at VM launch time, not
> > > runtime.  In practice,  SGX VM can be launched only with fixed EPC size
> > > and all those EPCs are fully committed to the VM once launched.
> > 
> > Fully committed doesn't mean those numbers are reflected in the cgroup.  A
> > VM scheduler can easily "commit" EPC to a guest, but allocate EPC on
> > demand, i.e.  when the guest attempts to actually access a page.
> > Preallocating memory isn't free, e.g. it can slow down guest boot, so it's
> > entirely reasonable to have virtual EPC be allocated on-demand.  Enforcing
> > at launch time doesn't work for such setups, because from the cgroup's
> > perspective, the VM is using 0 pages of EPC at launch.
> > 
> Maybe I understood the current implementation wrong. From what I see, vEPC
> is impossible not fully commit at launch time. The guest would EREMOVE all
> pages during initialization resulting #PF and all pages allocated. This
> essentially makes "prealloc=off" the same as "prealloc=on".
> Unless you are talking about some custom OS or kernel other than upstream
> Linux here?

Yes, a customer could be running an older kernel, something other than Linux, a
custom kernel, an out-of-tree SGX driver, etc.  The host should never assume
anything about the guest kernel when it comes to correctness (unless the guest
kernel is controlled by the host).

Doing EREMOVE on all pages is definitely not mandatory, especially if the kernel
detects a hypervisor, i.e. knows its running as a guest.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-17 12:58                           ` Haitao Huang
@ 2023-10-17 18:54                             ` Michal Koutný
  2023-10-17 19:13                               ` Michal Koutný
  2023-10-18  4:37                               ` Haitao Huang
  0 siblings, 2 replies; 144+ messages in thread
From: Michal Koutný @ 2023-10-17 18:54 UTC (permalink / raw)
  To: Haitao Huang
  Cc: Christopherson,,
	Sean, Huang, Kai, Zhang, Bo, linux-sgx, cgroups, yangjie,
	dave.hansen, Li, Zhiquan1, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, mikko.ylinen, Mehta, Sohil, bp, x86,
	kristen

[-- Attachment #1: Type: text/plain, Size: 1946 bytes --]

Hello Haitao.

On Tue, Oct 17, 2023 at 07:58:02AM -0500, Haitao Huang <haitao.huang@linux.intel.com> wrote:
> AFAIK, before we introducing max_write() callback in this series, no misc
> controller would possibly enforce the limit when misc.max is reduced. e.g. I
> don't think CVMs be killed when ASID limit is reduced and the cgroup was
> full before limit is reduced.

Yes, misccontroller was meant to be simple, current >= max serves to
prevent new allocations.

FTR, at some point in time memory.max was considered for reclaim control
of regular pages but it turned out to be too coarse (and OOM killing
processes if amount was not sensed correctly) and this eventually
evolved into specific mechanism of memory.reclaim.
So I'm mentioning this should that be an interface with better semantic
for your use case (and misc.max writes can remain non-preemptive).

One more note -- I was quite confused when I read in the rest of the
series about OOM and _kill_ing but then I found no such measure in the
code implementation. So I would suggest two terminological changes:

- the basic premise of the series (00/18) is that EPC pages are a
  different resource than memory, hence choose a better suiting name
  than OOM (out of memory) condition,
- killing -- (unless you have an intention to implement process
  termination later) My current interpretation that it is rather some
  aggressive unmapping within address space, so less confusing name for
  that would be "reclaim".


> I think EPC pages to VMs could have the same behavior, once they are given
> to a guest, never taken back by the host. For enclaves on host side, pages
> are reclaimable, that allows us to enforce in a similar way to memcg.

Is this distinction between preemptability of EPC pages mandated by the
HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
users an option to lock certain pages in memory that yields this
difference?

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-09-23  3:06   ` Haitao Huang
                     ` (7 preceding siblings ...)
  (?)
@ 2023-10-17 18:54   ` Michal Koutný
  2023-10-19 16:05     ` Haitao Huang
  -1 siblings, 1 reply; 144+ messages in thread
From: Michal Koutný @ 2023-10-17 18:54 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta, zhiquan1.li, kristen, seanjc,
	zhanb, anakrish, mikko.ylinen, yangjie

[-- Attachment #1: Type: text/plain, Size: 908 bytes --]

On Fri, Sep 22, 2023 at 08:06:55PM -0700, Haitao Huang <haitao.huang@linux.intel.com> wrote:
> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);

It should check for !epc_cg since the misc controller implementation
in misc_cg_alloc() would roll back even on non-allocated resources.

> +	cancel_work_sync(&epc_cg->reclaim_work);
> +	kfree(epc_cg);
> +}
> +
> +static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +	/* Let the reclaimer to do the work so user is not blocked */
> +	queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);

This is weird. The writer will never learn about the result of the
operation.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-23  3:06   ` Haitao Huang
                     ` (2 preceding siblings ...)
  (?)
@ 2023-10-17 18:55   ` Michal Koutný
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Koutný @ 2023-10-17 18:55 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta, zhiquan1.li, kristen, seanjc,
	zhanb, anakrish, mikko.ylinen, yangjie

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

On Fri, Sep 22, 2023 at 08:06:40PM -0700, Haitao Huang <haitao.huang@linux.intel.com> wrote:
> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
>  
>  	cg = css_misc(of_css(of));
>  
> -	if (READ_ONCE(misc_res_capacity[type]))
> +	if (READ_ONCE(misc_res_capacity[type])) {
>  		WRITE_ONCE(cg->res[type].max, max);
> -	else
> +		if (cg->res[type].max_write)
> +			cg->res[type].max_write(cg);
> +	} else {
>  		ret = -EINVAL;
>
> +	}

Is it time for a misc_cg_mutex? This given no synchronization guarantees
to implementors of max_write. (Alternatively, document it that the
callback must implement own synchronization.)


Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-17 18:54                             ` Michal Koutný
@ 2023-10-17 19:13                               ` Michal Koutný
  2023-10-18  4:39                                 ` Haitao Huang
  2023-10-18  4:37                               ` Haitao Huang
  1 sibling, 1 reply; 144+ messages in thread
From: Michal Koutný @ 2023-10-17 19:13 UTC (permalink / raw)
  To: Haitao Huang
  Cc: Christopherson,,
	Sean, Huang, Kai, Zhang, Bo, linux-sgx, cgroups, yangjie,
	dave.hansen, Li, Zhiquan1, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, mikko.ylinen, Mehta, Sohil, bp, x86,
	kristen

[-- Attachment #1: Type: text/plain, Size: 509 bytes --]

On Tue, Oct 17, 2023 at 08:54:48PM +0200, Michal Koutný <mkoutny@suse.com> wrote:
> Is this distinction between preemptability of EPC pages mandated by the
> HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
> users an option to lock certain pages in memory that yields this
> difference?

(After skimming Documentation/arch/x86/sgx.rst, Section "Virtual EPC")

Or would these two types warrant also two types of miscresource? (To
deal with each in own way.)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-17 18:54                             ` Michal Koutný
  2023-10-17 19:13                               ` Michal Koutný
@ 2023-10-18  4:37                               ` Haitao Huang
  2023-10-18 13:55                                 ` Dave Hansen
  1 sibling, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-18  4:37 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Christopherson,,
	Sean, Huang, Kai, Zhang, Bo, linux-sgx, cgroups, yangjie,
	dave.hansen, Li, Zhiquan1, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, mikko.ylinen, Mehta, Sohil, bp, x86,
	kristen

Hi Michal,

On Tue, 17 Oct 2023 13:54:46 -0500, Michal Koutný <mkoutny@suse.com> wrote:

> Hello Haitao.
>
> On Tue, Oct 17, 2023 at 07:58:02AM -0500, Haitao Huang  
> <haitao.huang@linux.intel.com> wrote:
>> AFAIK, before we introducing max_write() callback in this series, no  
>> misc
>> controller would possibly enforce the limit when misc.max is reduced.  
>> e.g. I
>> don't think CVMs be killed when ASID limit is reduced and the cgroup was
>> full before limit is reduced.
>
> Yes, misccontroller was meant to be simple, current >= max serves to
> prevent new allocations.
>
Thanks for confirming. Maybe another alternative we just keep max_write
non-preemptive. No need to add max_write() callback.

The EPC controller only triggers reclaiming on new allocations or return
NOMEM if no more to reclaim. Reclaiming here includes normal EPC page  
reclaiming and killing enclaves in out of EPC cases. vEPCs assigned to  
guests are basically carved out and never reclaimable by the host.

As we no longer enforce limits on max_write a lower value, user should not  
expect cgroup to force reclaim pages from enclave or kill VMs/enclaves as  
a result of reducing limits 'in-place'. User should always create cgroups,  
set limits, launch enclave/VM into the groups created.

> FTR, at some point in time memory.max was considered for reclaim control
> of regular pages but it turned out to be too coarse (and OOM killing
> processes if amount was not sensed correctly) and this eventually
> evolved into specific mechanism of memory.reclaim.
> So I'm mentioning this should that be an interface with better semantic
> for your use case (and misc.max writes can remain non-preemptive).
>

Yes we can introduce misc.reclaim to give user a knob to forcefully  
reducing usage if
that is really needed in real usage. The semantics would make force-kill  
VMs explicit to user.

> One more note -- I was quite confused when I read in the rest of the
> series about OOM and _kill_ing but then I found no such measure in the
> code implementation. So I would suggest two terminological changes:
>
> - the basic premise of the series (00/18) is that EPC pages are a
>   different resource than memory, hence choose a better suiting name
>   than OOM (out of memory) condition,

I couldn't come up a good name. Out of EPC (OOEPC) maybe? I feel OOEPC  
would be hard to read in code though. OOM was relatable as it is similar  
to normal OOM but special kind of memory :-) I'm open to any better  
suggestions.

> - killing -- (unless you have an intention to implement process
>   termination later) My current interpretation that it is rather some
>   aggressive unmapping within address space, so less confusing name for
>   that would be "reclaim".
>

yes. Killing here refers to killing enclave, analogous to killing process,
not just 'reclaim' though. I can change to always use 'killing enclave'  
explicitly.

>
>> I think EPC pages to VMs could have the same behavior, once they are  
>> given
>> to a guest, never taken back by the host. For enclaves on host side,  
>> pages
>> are reclaimable, that allows us to enforce in a similar way to memcg.
>
> Is this distinction between preemptability of EPC pages mandated by the
> HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
> users an option to lock certain pages in memory that yields this
> difference?
>

The difference is really a result of current vEPC implementation. Because
enclave pages once in use contains confidential content, they need special
process to reclaim. So it's complex to implement host reclaiming guest EPCs
gracefully.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-17 19:13                               ` Michal Koutný
@ 2023-10-18  4:39                                 ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-18  4:39 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Christopherson,,
	Sean, Huang, Kai, Zhang, Bo, linux-sgx, cgroups, yangjie,
	dave.hansen, Li, Zhiquan1, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, mikko.ylinen, Mehta, Sohil, bp, x86,
	kristen

On Tue, 17 Oct 2023 14:13:22 -0500, Michal Koutný <mkoutny@suse.com> wrote:

> On Tue, Oct 17, 2023 at 08:54:48PM +0200, Michal Koutný  
> <mkoutny@suse.com> wrote:
>> Is this distinction between preemptability of EPC pages mandated by the
>> HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
>> users an option to lock certain pages in memory that yields this
>> difference?
>
> (After skimming Documentation/arch/x86/sgx.rst, Section "Virtual EPC")
>
> Or would these two types warrant also two types of miscresource? (To
> deal with each in own way.)

They are from the same bucket of HW resource so I think it's more suitable  
to be one resource type. Otherwise need to policy to dividing the  
capacity, etc. And it is still possible in future vEPC become reclaimable.

My current thinking is we probably can get away with non-preemptive  
max_write for enclaves too. See my other reply.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-18  4:37                               ` Haitao Huang
@ 2023-10-18 13:55                                 ` Dave Hansen
  2023-10-18 15:26                                   ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-10-18 13:55 UTC (permalink / raw)
  To: Haitao Huang, Michal Koutný
  Cc: Christopherson,,
	Sean, Huang, Kai, Zhang, Bo, linux-sgx, cgroups, yangjie,
	dave.hansen, Li, Zhiquan1, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, mikko.ylinen, Mehta, Sohil, bp, x86,
	kristen

On 10/17/23 21:37, Haitao Huang wrote:
> Yes we can introduce misc.reclaim to give user a knob to forcefully 
> reducing usage if that is really needed in real usage. The semantics
> would make force-kill VMs explicit to user.

Do any other controllers do something like this?  It seems odd.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-18 13:55                                 ` Dave Hansen
@ 2023-10-18 15:26                                   ` Haitao Huang
  2023-10-18 15:37                                     ` Dave Hansen
  0 siblings, 1 reply; 144+ messages in thread
From: Haitao Huang @ 2023-10-18 15:26 UTC (permalink / raw)
  To: Michal Koutný, Dave Hansen
  Cc: Christopherson,,
	Sean, Huang, Kai, Zhang, Bo, linux-sgx, cgroups, yangjie,
	dave.hansen, Li, Zhiquan1, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, mikko.ylinen, Mehta, Sohil, bp, x86,
	kristen

On Wed, 18 Oct 2023 08:55:12 -0500, Dave Hansen <dave.hansen@intel.com>  
wrote:

> On 10/17/23 21:37, Haitao Huang wrote:
>> Yes we can introduce misc.reclaim to give user a knob to forcefully
>> reducing usage if that is really needed in real usage. The semantics
>> would make force-kill VMs explicit to user.
>
> Do any other controllers do something like this?  It seems odd.

Maybe not in sense of killing something. My understanding memory.reclaim  
does not necessarily invoke the OOM killer. But what I really intend to  
say is we can have a separate knob for user to express the need for  
reducing the current usage explicitly and keep "misc.max' non-preemptive  
semantics intact. When we implement that new knob, then we can define what  
kind of reclaim for that. Depending on vEPC implementation, it may or may  
not involve killing VMs. But at least that semantics will be explicit for  
user.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-18 15:26                                   ` Haitao Huang
@ 2023-10-18 15:37                                     ` Dave Hansen
  2023-10-18 15:52                                       ` Michal Koutný
  0 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-10-18 15:37 UTC (permalink / raw)
  To: Haitao Huang, Michal Koutný
  Cc: Christopherson,,
	Sean, Huang, Kai, Zhang, Bo, linux-sgx, cgroups, yangjie,
	dave.hansen, Li, Zhiquan1, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, mikko.ylinen, Mehta, Sohil, bp, x86,
	kristen

On 10/18/23 08:26, Haitao Huang wrote:
> Maybe not in sense of killing something. My understanding memory.reclaim
> does not necessarily invoke the OOM killer. But what I really intend to
> say is we can have a separate knob for user to express the need for
> reducing the current usage explicitly and keep "misc.max' non-preemptive
> semantics intact. When we implement that new knob, then we can define
> what kind of reclaim for that. Depending on vEPC implementation, it may
> or may not involve killing VMs. But at least that semantics will be
> explicit for user.

I'm really worried that you're going for "perfect" semantics here.  This
is SGX.  It's *NOT* getting heavy use and even fewer folks will ever
apply cgroup controls to it.

Can we please stick to simple, easily-coded rules here?  I honestly
don't think these corner cases matter all that much and there's been
*WAY* too much traffic in this thread for what is ultimately not that
complicated.  Focus on *ONE* thing:

1. Admin sets a limit
2. Enclave is created
3. Enclave hits limit, allocation fails

Nothing else matters.  What if the admin lowers the limit on an
already-created enclave?  Nobody cares.  Seriously.  What about inducing
reclaim?  Nobody cares.  What about vEPC?  Doesn't matter, an enclave
page is an enclave page.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-18 15:37                                     ` Dave Hansen
@ 2023-10-18 15:52                                       ` Michal Koutný
  2023-10-18 16:25                                         ` Haitao Huang
  0 siblings, 1 reply; 144+ messages in thread
From: Michal Koutný @ 2023-10-18 15:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Haitao Huang, Christopherson,,
	Sean, Huang, Kai, Zhang, Bo, linux-sgx, cgroups, yangjie,
	dave.hansen, Li, Zhiquan1, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, mikko.ylinen, Mehta, Sohil, bp, x86,
	kristen

[-- Attachment #1: Type: text/plain, Size: 469 bytes --]

On Wed, Oct 18, 2023 at 08:37:25AM -0700, Dave Hansen <dave.hansen@intel.com> wrote:
> 1. Admin sets a limit
> 2. Enclave is created
> 3. Enclave hits limit, allocation fails

I was actually about to suggest reorganizing the series to a part
implementing this simple limiting and a subsequent part with the reclaim
stuff for easier digestability. 

> Nothing else matters.

If the latter part is an unncessary overkill, it's even better.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-10-18 15:52                                       ` Michal Koutný
@ 2023-10-18 16:25                                         ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-18 16:25 UTC (permalink / raw)
  To: Dave Hansen, Michal Koutný
  Cc: Christopherson,,
	Sean, Huang, Kai, Zhang, Bo, linux-sgx, cgroups, yangjie,
	dave.hansen, Li, Zhiquan1, linux-kernel, mingo, tglx, tj,
	anakrish, jarkko, hpa, mikko.ylinen, Mehta, Sohil, bp, x86,
	kristen

On Wed, 18 Oct 2023 10:52:23 -0500, Michal Koutný <mkoutny@suse.com> wrote:

> On Wed, Oct 18, 2023 at 08:37:25AM -0700, Dave Hansen  
> <dave.hansen@intel.com> wrote:
>> 1. Admin sets a limit
>> 2. Enclave is created
>> 3. Enclave hits limit, allocation fails
>
> I was actually about to suggest reorganizing the series to a part
> implementing this simple limiting and a subsequent part with the reclaim
> stuff for easier digestability.
>
>> Nothing else matters.
>
> If the latter part is an unncessary overkill, it's even better.
>

Ok. I'll take out max_write() callback and only implement non-preemptive  
misc.max for EPC.
I can also separate OOEPC_killing enclaves out, which is not needed if we  
only block allocation at limit, no need killing one enclave to make space  
for another. This will simplify a lot.

Thanks to all for your input!

Haitao
 

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-10-17 18:54   ` Michal Koutný
@ 2023-10-19 16:05     ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-19 16:05 UTC (permalink / raw)
  To: Michal Koutný
  Cc: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups,
	tglx, mingo, bp, hpa, sohil.mehta, zhiquan1.li, kristen, seanjc,
	zhanb, anakrish, mikko.ylinen, yangjie

On Tue, 17 Oct 2023 13:54:54 -0500, Michal Koutný <mkoutny@suse.com> wrote:

> On Fri, Sep 22, 2023 at 08:06:55PM -0700, Haitao Huang  
> <haitao.huang@linux.intel.com> wrote:
>> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
>> +{
>> +	struct sgx_epc_cgroup *epc_cg;
>> +
>> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
>
> It should check for !epc_cg since the misc controller implementation
> in misc_cg_alloc() would roll back even on non-allocated resources.

Good catch. Will fix.

>
>> +	cancel_work_sync(&epc_cg->reclaim_work);
>> +	kfree(epc_cg);
>> +}
>> +
>> +static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
>> +{
>> +	struct sgx_epc_reclaim_control rc;
>> +	struct sgx_epc_cgroup *epc_cg;
>> +
>> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
>> +
>> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
>> +	/* Let the reclaimer to do the work so user is not blocked */
>> +	queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
>
> This is weird. The writer will never learn about the result of the
> operation.
>
Right. With the new plan, this callback will be removed.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-10-10  0:26   ` Huang, Kai
@ 2023-10-22 18:26     ` Haitao Huang
  0 siblings, 0 replies; 144+ messages in thread
From: Haitao Huang @ 2023-10-22 18:26 UTC (permalink / raw)
  To: hpa, linux-sgx, x86, dave.hansen, cgroups, bp, linux-kernel,
	jarkko, tglx, Mehta, Sohil, tj, mingo, Huang, Kai
  Cc: kristen, yangjie, Li, Zhiquan1, Christopherson,,
	Sean, mikko.ylinen, Zhang, Bo, anakrish

On Mon, 09 Oct 2023 19:26:01 -0500, Huang, Kai <kai.huang@intel.com> wrote:

>
>> @@ -332,6 +336,7 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists  
>> *lru, size_t nr_to_scan,
>>   * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
>>   * @nr_to_scan:		 Number of EPC pages to scan for reclaim
>>   * @ignore_age:		 Reclaim a page even if it is young
>> + * @epc_cg:		 EPC cgroup from which to reclaim
>>   *
>>   * Take a fixed number of pages from the head of the active page pool  
>> and
>>   * reclaim them to the enclave's private shmem files. Skip the pages,  
>> which have
>> @@ -345,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists  
>> *lru, size_t nr_to_scan,
>>   * problematic as it would increase the lock contention too much,  
>> which would
>>   * halt forward progress.
>>   */
>> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
>> +			     struct sgx_epc_cgroup *epc_cg)
>>  {
>>  	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>>  	struct sgx_epc_page *epc_page, *tmp;
>> @@ -355,7 +361,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan,  
>> bool ignore_age)
>>  	LIST_HEAD(iso);
>>  	size_t ret, i;
>>
>> -	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
>> +	/*
>> +	 * If a specific cgroup is not being targeted, take from the global
>> +	 * list first, even when cgroups are enabled.  If there are
>> +	 * pages on the global LRU then they should get reclaimed asap.
>> +	 */

This is probably some obsolete comments I should have removed. When cgroup  
is enabled, reclaimables will be always in a cgroup, the root by default.  
(!epc_cg) condition is harmless but not needed because the global list  
will be empty if cgroup is enabled.

>> +	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
>> +		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
>> +
>> +	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
>

So it should have been:

+	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
+	else
+		sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);

Or just encapsulate the difference in  sgx_epc_cgroup_isolate_pages

> (I wish such code can be somehow moved to the earlier patches, so that  
> we can
> get early idea that how sgx_reclaim_epc_pages() is supposed to be used.)
>

I'll will try to restructure and split this patch. Now that we are not  
going to deal with unreclaimable, it'd be simpler and also easier to  
restructure.

> So here when we are not targeting a specific EPC cgroup, we always  
> reclaim from
> the global list first, ...
>
> [...]
>
>>
>>  	if (list_empty(&iso))
>>  		return 0;
>> @@ -423,7 +437,7 @@ static bool sgx_should_reclaim(unsigned long  
>> watermark)
>>  void sgx_reclaim_direct(void)
>>  {
>>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
>> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>
> ... and we always try to reclaim the global list first when directly  
> reclaim is
> desired, even the enclave is within some EPC cgroup.  ...
>
>>  }
>>
>>  static int ksgxd(void *p)
>> @@ -446,7 +460,7 @@ static int ksgxd(void *p)
>>  				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>>
>>  		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
>> -			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
>> +			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>
> ... and in ksgxd() as well, which I guess is somehow acceptable.  ...
>
>>
>>  		cond_resched();
>>  	}
>> @@ -600,6 +614,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>>  {
>>  	struct sgx_epc_page *page;
>> +	struct sgx_epc_cgroup *epc_cg;
>> +
>> +	epc_cg = sgx_epc_cgroup_try_charge(reclaim);
>> +	if (IS_ERR(epc_cg))
>> +		return ERR_CAST(epc_cg);

I think I need add comments to clarify after this point is the global  
reclaimer only to keep the global free page water mark satisfied. So all  
reclaiming is from the root if cgroup is enabled, otherwise from the  
global LRU (no change from current implementation).

>>
>>  	for ( ; ; ) {
>>  		page = __sgx_alloc_epc_page();
>> @@ -608,8 +627,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void  
>> *owner, bool reclaim)
>>  			break;
>>  		}
>>
>> -		if (!sgx_can_reclaim())
>> -			return ERR_PTR(-ENOMEM);
>> +		if (!sgx_can_reclaim()) {
>> +			page = ERR_PTR(-ENOMEM);
>> +			break;
>> +		}
>>
>>  		if (!reclaim) {
>>  			page = ERR_PTR(-EBUSY);
>> @@ -621,10 +642,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void  
>> *owner, bool reclaim)
>>  			break;
>>  		}
>>
>> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
>> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>
> ... and when an EPC page is allocated, no matter whether the EPC page  
> belongs to
> any cgroup or not.
>
> When we are allocating EPC page for one enclave, if that enclave belongs  
> to some
> cgroup, is it more reasonable to reclaim EPC pages from it's own group  
> (and the
> children under it)?
>
> You already got the current EPC cgroup at the beginning of  
> sgx_alloc_epc_page()
> when you want to charge the EPC allocation.
>
>>  		cond_resched();
>>  	}
>>

I hope the above comments make it clear that all these calls on  
sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL) are to reclaim from the  
global list if cgroup is not enabled, or from the root if cgroup is  
enabled.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 144+ messages in thread

end of thread, other threads:[~2023-10-22 18:26 UTC | newest]

Thread overview: 144+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-23  3:06 [PATCH v5 00/18] Add Cgroup support for SGX EPC memory Haitao Huang
2023-09-23  3:06 ` Haitao Huang
2023-09-23  3:06 ` [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-25 17:09   ` Jarkko Sakkinen
2023-09-25 17:09     ` Jarkko Sakkinen
2023-09-26  3:04     ` Haitao Huang
2023-09-26 13:10       ` Jarkko Sakkinen
2023-09-26 13:10         ` Jarkko Sakkinen
2023-09-26 13:13         ` Jarkko Sakkinen
2023-09-26 13:13           ` Jarkko Sakkinen
2023-09-27  1:56           ` Haitao Huang
2023-10-02 22:47             ` Jarkko Sakkinen
2023-10-02 22:55               ` Jarkko Sakkinen
2023-10-04 15:45                 ` Haitao Huang
2023-10-04 17:18                   ` Tejun Heo
2023-09-27  9:20   ` Huang, Kai
2023-10-03 14:29     ` Haitao Huang
2023-10-17 18:55   ` Michal Koutný
2023-09-23  3:06 ` [PATCH v5 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-25 18:50   ` Tejun Heo
2023-09-25 18:50     ` Tejun Heo
2023-09-28  3:59   ` Huang, Kai
2023-10-03  7:00     ` Haitao Huang
2023-10-03 19:33       ` Huang, Kai
2023-09-23  3:06 ` [PATCH v5 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists Haitao Huang
2023-09-23  3:06 ` [PATCH v5 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-23  3:06 ` [PATCH v5 05/18] x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-27 10:14   ` Huang, Kai
2023-09-23  3:06 ` [PATCH v5 06/18] x86/sgx: Introduce EPC page states Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-25 17:11   ` Jarkko Sakkinen
2023-09-25 17:11     ` Jarkko Sakkinen
2023-09-27 10:28   ` Huang, Kai
2023-10-03  4:49     ` Haitao Huang
2023-10-03 20:03       ` Huang, Kai
2023-10-04 15:24         ` Haitao Huang
2023-10-04 21:05           ` Huang, Kai
2023-09-23  3:06 ` [PATCH v5 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-25 17:13   ` Jarkko Sakkinen
2023-09-25 17:13     ` Jarkko Sakkinen
2023-09-27 10:42   ` Huang, Kai
2023-09-23  3:06 ` [PATCH v5 08/18] x86/sgx: Use a list to track to-be-reclaimed pages Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-28  9:28   ` Huang, Kai
2023-10-03  5:09     ` Haitao Huang
2023-09-23  3:06 ` [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-27 11:14   ` Huang, Kai
2023-09-27 15:35     ` Haitao Huang
2023-09-27 21:21       ` Huang, Kai
2023-09-29 15:06         ` Haitao Huang
2023-10-02 11:05           ` Huang, Kai
2023-09-27 11:35   ` Huang, Kai
2023-10-03  6:45     ` Haitao Huang
2023-10-03 20:07       ` Huang, Kai
2023-10-04 15:03         ` Haitao Huang
2023-10-04 21:13           ` Huang, Kai
2023-10-05  4:22             ` Haitao Huang
2023-10-05  6:49               ` Huang, Kai
2023-09-23  3:06 ` [PATCH v5 10/18] x86/sgx: Add EPC page flags to identify owner types Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-23  3:06 ` [PATCH v5 11/18] x86/sgx: store unreclaimable pages in LRU lists Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-27 11:57   ` Huang, Kai
2023-10-03  5:42     ` Haitao Huang
2023-09-28  9:41   ` Huang, Kai
2023-10-03  5:15     ` Haitao Huang
2023-10-03 20:12       ` Huang, Kai
2023-09-23  3:06 ` [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-10-09 23:45   ` Huang, Kai
2023-10-10  0:23     ` Sean Christopherson
2023-10-10  0:50       ` Huang, Kai
2023-10-10  1:34         ` Huang, Kai
2023-10-10 16:49           ` Haitao Huang
2023-10-11  0:51             ` Huang, Kai
2023-10-12 13:27               ` Haitao Huang
2023-10-16 10:57                 ` Huang, Kai
2023-10-16 19:52                   ` Haitao Huang
2023-10-16 21:09                     ` Huang, Kai
2023-10-17  0:10                       ` Haitao Huang
2023-10-17  1:34                         ` Huang, Kai
2023-10-17 12:58                           ` Haitao Huang
2023-10-17 18:54                             ` Michal Koutný
2023-10-17 19:13                               ` Michal Koutný
2023-10-18  4:39                                 ` Haitao Huang
2023-10-18  4:37                               ` Haitao Huang
2023-10-18 13:55                                 ` Dave Hansen
2023-10-18 15:26                                   ` Haitao Huang
2023-10-18 15:37                                     ` Dave Hansen
2023-10-18 15:52                                       ` Michal Koutný
2023-10-18 16:25                                         ` Haitao Huang
2023-10-16 21:32                     ` Sean Christopherson
2023-10-17  0:09                       ` Haitao Huang
2023-10-17 15:43                         ` Sean Christopherson
2023-10-17 11:49                       ` Mikko Ylinen
2023-10-11  1:14             ` Huang, Kai
2023-10-16 11:02               ` Huang, Kai
2023-10-10  1:42       ` Haitao Huang
2023-10-10  2:23         ` Huang, Kai
2023-10-10 13:26           ` Haitao Huang
2023-10-11  0:01             ` Sean Christopherson
2023-10-11 15:02               ` Haitao Huang
2023-10-10  1:04     ` Haitao Huang
2023-10-10  1:18       ` Huang, Kai
2023-10-10  1:38         ` Haitao Huang
2023-10-10  2:12           ` Huang, Kai
2023-10-10 17:05             ` Haitao Huang
2023-10-11  0:31               ` Huang, Kai
2023-10-11 16:04                 ` Haitao Huang
2023-09-23  3:06 ` [PATCH v5 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-10-05 12:24   ` Huang, Kai
2023-10-05 19:23     ` Haitao Huang
2023-10-05 20:25       ` Huang, Kai
2023-09-23  3:06 ` [PATCH v5 14/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-23  3:06 ` [PATCH v5 15/18] x86/sgx: Prepare for multiple LRUs Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-10-05 12:30   ` Huang, Kai
2023-10-05 19:33     ` Haitao Huang
2023-10-05 20:38       ` Huang, Kai
2023-09-23  3:06 ` [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-25 17:15   ` Jarkko Sakkinen
2023-09-25 17:15     ` Jarkko Sakkinen
2023-10-05 21:01   ` Huang, Kai
2023-10-10  0:12   ` Huang, Kai
2023-10-10  0:16   ` Huang, Kai
2023-10-10  0:26   ` Huang, Kai
2023-10-22 18:26     ` Haitao Huang
2023-10-10  9:19   ` Huang, Kai
2023-10-10  9:32   ` Huang, Kai
2023-10-17 18:54   ` Michal Koutný
2023-10-19 16:05     ` Haitao Huang
2023-09-23  3:06 ` [PATCH v5 17/18] Docs/x86/sgx: Add description for cgroup support Haitao Huang
2023-09-23  3:06   ` Haitao Huang
2023-09-23  3:06 ` [PATCH v5 18/18] selftests/sgx: Add scripts for EPC cgroup testing Haitao Huang
2023-09-23  3:06   ` Haitao Huang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.