linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/18]  Add Cgroup support for SGX EPC memory
@ 2022-12-02 18:36 Kristen Carlson Accardi
  2022-12-02 18:36 ` [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages() Kristen Carlson Accardi
                   ` (18 more replies)
  0 siblings, 19 replies; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups; +Cc: zhiquan1.li

Utilize the Miscellaneous cgroup controller to regulate the distribution
of SGX EPC memory, which is a subset of system RAM that is used to provide
SGX-enabled applications with protected memory, and is otherwise inaccessible.

SGX EPC memory allocations are separate from normal RAM allocations,
and is managed solely by the SGX subsystem. The existing cgroup memory
controller cannot be used to limit or account for SGX EPC memory.

This patchset implements the support for sgx_epc memory within the 
misc cgroup controller, and then utilizes the misc cgroup controller
to provide support for setting the total system capacity, max limit
per cgroup, and events.

This work was originally authored by Sean Christopherson a few years ago,
and was modified to work with more recent kernels, and to utilize the
misc cgroup controller rather than a custom controller. It is currently
based on top of the MCA patches.

Here's the MCA patchset for reference.
https://lore.kernel.org/linux-sgx/2d52c8c4-8ed0-6df2-2911-da5b9fcc9ae4@intel.com/T/#t

The patchset adds support for multiple LRUs to track both reclaimable
EPC pages (i.e. pages the reclaimer knows about), as well as unreclaimable
EPC pages (i.e. pages which the reclaimer isn't aware of, such as va pages).
These pages are assigned to an LRU, as well as an enclave, so that an
enclave's full EPC usage can be tracked, and limited to a max value. During
OOM events, an enclave can be have its memory zapped, and all the EPC pages
not tracked by the reclaimer can be freed.

I appreciate your comments and feedback.

Changelog:

v2:
 * rename struct sgx_epc_lru to sgx_epc_lru_lists to be more clear
   that this struct contains 2 lists.
 * use inline functions rather than macros for sgx_epc_page_list*
   wrappers.
 * Remove flags macros and open code all flags.
 * Improve the commit message for RECLAIM_IN_PROGRESS patch to make
   it more clear what the patch does.
 * remove notifier_block from misc cgroup changes and use a set
   of ops for callbacks instead.
 * rename root_misc to misc_cg_root and parent_misc to misc_cg_parent
 * consolidate misc cgroup changes to 2 patches and remove most of
   the previous helper functions.

Kristen Carlson Accardi (7):
  x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  x86/sgx: Use sgx_epc_lru_lists for existing active page list
  x86/sgx: Track epc pages on reclaimable or unreclaimable lists
  cgroup/misc: Add per resource callbacks for css events
  cgroup/misc: Prepare for SGX usage
  x86/sgx: Add support for misc cgroup controller
  Docs/x86/sgx: Add description for cgroup support

Sean Christopherson (11):
  x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
  x86/sgx: Store struct sgx_encl when allocating new VA pages
  x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
  x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
  x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
  x86/sgx: Return the number of EPC pages that were successfully
    reclaimed
  x86/sgx: Add option to ignore age of page during EPC reclaim
  x86/sgx: Prepare for multiple LRUs
  x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
  x86/sgx: Add EPC OOM path to forcefully reclaim EPC

 Documentation/x86/sgx.rst            |  77 ++++
 arch/x86/Kconfig                     |  13 +
 arch/x86/kernel/cpu/sgx/Makefile     |   1 +
 arch/x86/kernel/cpu/sgx/encl.c       |  90 ++++-
 arch/x86/kernel/cpu/sgx/encl.h       |   4 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 539 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 +++
 arch/x86/kernel/cpu/sgx/ioctl.c      |  14 +-
 arch/x86/kernel/cpu/sgx/main.c       | 412 ++++++++++++++++----
 arch/x86/kernel/cpu/sgx/sgx.h        | 122 +++++-
 arch/x86/kernel/cpu/sgx/virt.c       |  28 +-
 include/linux/misc_cgroup.h          |  35 ++
 kernel/cgroup/misc.c                 |  76 +++-
 13 files changed, 1341 insertions(+), 129 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

-- 
2.38.1


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-02 21:33   ` Dave Hansen
  2022-12-02 18:36 ` [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages Kristen Carlson Accardi
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

In order to avoid repetition of cond_resched() in ksgxd() and
sgx_alloc_epc_page(), move the invocation of post-reclaim cond_resched()
inside sgx_reclaim_pages(). Except in the case of sgx_reclaim_direct(),
sgx_reclaim_pages() is always called in a loop and is always followed
by a call to cond_resched().  This will hold true for the EPC cgroup
as well, which adds even more calls to sgx_reclaim_pages() and thus
cond_resched(). Calls to sgx_reclaim_direct() may be performance
sensitive. Allow sgx_reclaim_direct() to avoid the cond_resched()
call by moving the original sgx_reclaim_pages() call to
__sgx_reclaim_pages() and then have sgx_reclaim_pages() become a
wrapper around that call with a cond_resched().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 160c8dbee0ab..ffce6fc70a1f 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -287,7 +287,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static void sgx_reclaim_pages(void)
+static void __sgx_reclaim_pages(void)
 {
 	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
 	struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -369,6 +369,12 @@ static void sgx_reclaim_pages(void)
 	}
 }
 
+static void sgx_reclaim_pages(void)
+{
+	__sgx_reclaim_pages();
+	cond_resched();
+}
+
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
@@ -378,12 +384,14 @@ static bool sgx_should_reclaim(unsigned long watermark)
 /*
  * sgx_reclaim_direct() should be called (without enclave's mutex held)
  * in locations where SGX memory resources might be low and might be
- * needed in order to make forward progress.
+ * needed in order to make forward progress. This call to
+ * __sgx_reclaim_pages() avoids the cond_resched() in sgx_reclaim_pages()
+ * to improve performance.
  */
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages();
+		__sgx_reclaim_pages();
 }
 
 static int ksgxd(void *p)
@@ -410,8 +418,6 @@ static int ksgxd(void *p)
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
 			sgx_reclaim_pages();
-
-		cond_resched();
 	}
 
 	return 0;
@@ -582,7 +588,6 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 		}
 
 		sgx_reclaim_pages();
-		cond_resched();
 	}
 
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
  2022-12-02 18:36 ` [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages() Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-02 21:35   ` Dave Hansen
  2022-12-02 18:36 ` [PATCH v2 03/18] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s) Kristen Carlson Accardi
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

When allocating new Version Array (VA) pages, pass the struct sgx_encl
of the enclave that is allocating the page. sgx_alloc_epc_page() will
store this value in the encl_owner field of the struct sgx_epc_page. In
a later patch, VA pages will be placed in an unreclaimable queue,
and then when the cgroup max limit is reached and there are no more
reclaimable pages and the enclave must be oom killed, all the
VA pages associated with that enclave can be uncharged and freed.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/encl.c  | 5 +++--
 arch/x86/kernel/cpu/sgx/encl.h  | 2 +-
 arch/x86/kernel/cpu/sgx/ioctl.c | 2 +-
 arch/x86/kernel/cpu/sgx/sgx.h   | 1 +
 4 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f40d64206ded..4eaf9d21e71b 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -1193,6 +1193,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
+ * @encl:    The enclave that this page is allocated to.
  * @reclaim: Reclaim EPC pages directly if none available. Enclave
  *           mutex should not be held if this is set.
  *
@@ -1202,12 +1203,12 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 {
 	struct sgx_epc_page *epc_page;
 	int ret;
 
-	epc_page = sgx_alloc_epc_page(NULL, reclaim);
+	epc_page = sgx_alloc_epc_page(encl, reclaim);
 	if (IS_ERR(epc_page))
 		return ERR_CAST(epc_page);
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..831d63f80f5a 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,7 +116,7 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
 					  unsigned long offset,
 					  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index ebe79d60619f..9a1bb3c3211a 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -30,7 +30,7 @@ struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
 		if (!va_page)
 			return ERR_PTR(-ENOMEM);
 
-		va_page->epc_page = sgx_alloc_va_page(reclaim);
+		va_page->epc_page = sgx_alloc_va_page(encl, reclaim);
 		if (IS_ERR(va_page->epc_page)) {
 			err = ERR_CAST(va_page->epc_page);
 			kfree(va_page);
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index d16a8baa28d4..39cb15a8abcb 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -39,6 +39,7 @@ struct sgx_epc_page {
 		struct sgx_encl_page *encl_owner;
 		/* Use when SGX_EPC_PAGE_KVM_GUEST set in ->flags: */
 		void __user *vepc_vaddr;
+		struct sgx_encl *encl;
 	};
 	struct list_head list;
 };
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 03/18] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
  2022-12-02 18:36 ` [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages() Kristen Carlson Accardi
  2022-12-02 18:36 ` [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-02 21:39   ` Dave Hansen
  2022-12-08 15:31   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list Kristen Carlson Accardi
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

Introduce a data structure to wrap the existing reclaimable list
and its spinlock in a struct to minimize the code changes needed
to handle multiple LRUs as well as reclaimable and non-reclaimable
lists, both of which will be introduced and used by SGX EPC cgroups.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/sgx.h | 65 +++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 39cb15a8abcb..5e6d88438fae 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -90,6 +90,71 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 	return section->virt_addr + index * PAGE_SIZE;
 }
 
+/*
+ * This data structure wraps a list of reclaimable EPC pages, and a list of
+ * non-reclaimable EPC pages and is used to implement a LRU policy during
+ * reclamation.
+ */
+struct sgx_epc_lru_lists {
+	spinlock_t lock;
+	struct list_head reclaimable;
+	struct list_head unreclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
+{
+	spin_lock_init(&lrus->lock);
+	INIT_LIST_HEAD(&lrus->reclaimable);
+	INIT_LIST_HEAD(&lrus->unreclaimable);
+}
+
+/*
+ * Must be called with queue lock acquired
+ */
+static inline void __sgx_epc_page_list_push(struct list_head *list, struct sgx_epc_page *page)
+{
+	list_add_tail(&page->list, list);
+}
+
+/*
+ * Must be called with queue lock acquired
+ */
+static inline struct sgx_epc_page * __sgx_epc_page_list_pop(struct list_head *list)
+{
+	struct sgx_epc_page *epc_page;
+
+	if (list_empty(list))
+		return NULL;
+
+	epc_page = list_first_entry(list, struct sgx_epc_page, list);
+	list_del_init(&epc_page->list);
+	return epc_page;
+}
+
+static inline struct sgx_epc_page *
+sgx_epc_pop_reclaimable(struct sgx_epc_lru_lists *lrus)
+{
+	return __sgx_epc_page_list_pop(&(lrus)->reclaimable);
+}
+
+static inline void sgx_epc_push_reclaimable(struct sgx_epc_lru_lists *lrus,
+					    struct sgx_epc_page *page)
+{
+	__sgx_epc_page_list_push(&(lrus)->reclaimable, page);
+}
+
+static inline struct sgx_epc_page *
+sgx_epc_pop_unreclaimable(struct sgx_epc_lru_lists *lrus)
+{
+	return __sgx_epc_page_list_pop(&(lrus)->unreclaimable);
+}
+
+static inline void sgx_epc_push_unreclaimable(struct sgx_epc_lru_lists *lrus,
+					      struct sgx_epc_page *page)
+{
+	__sgx_epc_page_list_push(&(lrus)->unreclaimable, page);
+}
+
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (2 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 03/18] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s) Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-02 21:43   ` Dave Hansen
  2022-12-02 18:36 ` [PATCH v2 05/18] x86/sgx: Track epc pages on reclaimable or unreclaimable lists Kristen Carlson Accardi
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

Replace the existing sgx_active_page_list and its spinlock with
a global sgx_epc_lru_lists struct.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index ffce6fc70a1f..447cf4b8580c 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -26,10 +26,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_lists sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -298,14 +297,12 @@ static void __sgx_reclaim_pages(void)
 	int ret;
 	int i;
 
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		if (list_empty(&sgx_active_page_list))
+		epc_page = sgx_epc_pop_reclaimable(&sgx_global_lru);
+		if (!epc_page)
 			break;
 
-		epc_page = list_first_entry(&sgx_active_page_list,
-					    struct sgx_epc_page, list);
-		list_del_init(&epc_page->list);
 		encl_page = epc_page->encl_owner;
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
@@ -316,7 +313,7 @@ static void __sgx_reclaim_pages(void)
 			 */
 			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	for (i = 0; i < cnt; i++) {
 		epc_page = chunk[i];
@@ -339,9 +336,9 @@ static void __sgx_reclaim_pages(void)
 		continue;
 
 skip:
-		spin_lock(&sgx_reclaimer_lock);
-		list_add_tail(&epc_page->list, &sgx_active_page_list);
-		spin_unlock(&sgx_reclaimer_lock);
+		spin_lock(&sgx_global_lru.lock);
+		sgx_epc_push_reclaimable(&sgx_global_lru, epc_page);
+		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 
@@ -378,7 +375,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_active_page_list);
+	       !list_empty(&sgx_global_lru.reclaimable);
 }
 
 /*
@@ -433,6 +430,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
 	ksgxd_tsk = tsk;
 
+	sgx_lru_init(&sgx_global_lru);
+
 	return true;
 }
 
@@ -508,10 +507,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_active_page_list);
-	spin_unlock(&sgx_reclaimer_lock);
+	sgx_epc_push_reclaimable(&sgx_global_lru, page);
+	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
@@ -526,18 +525,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_reclaimer_lock);
+			spin_unlock(&sgx_global_lru.lock);
 			return -EBUSY;
 		}
 
 		list_del(&page->list);
 		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
 }
@@ -574,7 +573,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_active_page_list))
+		if (list_empty(&sgx_global_lru.reclaimable))
 			return ERR_PTR(-ENOMEM);
 
 		if (!reclaim) {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 05/18] x86/sgx: Track epc pages on reclaimable or unreclaimable lists
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (3 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-02 22:13   ` Dave Hansen
  2022-12-02 18:36 ` [PATCH v2 06/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages Kristen Carlson Accardi
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

Replace functions sgx_mark_page_reclaimable() and
sgx_unmark_page_reclaimable() with sgx_record_epc_page() and
sgx_drop_epc_page(). sgx_record_epc_page() wil add the epc_page
to the correct "reclaimable" or "unreclaimable" list in the
sgx_epc_lru_lists struct. sgx_drop_epc_page() will delete the page
from the LRU list. Tracking pages that are not tracked by
the reclaimer in the sgx_epc_lru_lists "unreclaimable" list allows
an OOM event to cause all the pages in use by an enclave to be freed,
regardless of whether they were reclaimable pages or not.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/encl.c  | 10 +++++++---
 arch/x86/kernel/cpu/sgx/ioctl.c | 11 +++++++----
 arch/x86/kernel/cpu/sgx/main.c  | 26 +++++++++++++++-----------
 arch/x86/kernel/cpu/sgx/sgx.h   |  4 ++--
 arch/x86/kernel/cpu/sgx/virt.c  | 28 ++++++++++++++++++++--------
 5 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 4eaf9d21e71b..4683da9ef4f1 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -252,6 +252,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		epc_page = sgx_encl_eldu(&encl->secs, NULL);
 		if (IS_ERR(epc_page))
 			return ERR_CAST(epc_page);
+		sgx_record_epc_page(epc_page, 0);
 	}
 
 	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
@@ -259,7 +260,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_mark_page_reclaimable(entry->epc_page);
+	sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 
 	return entry;
 }
@@ -375,7 +376,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl_page->type = SGX_PAGE_TYPE_REG;
 	encl->secs_child_cnt++;
 
-	sgx_mark_page_reclaimable(encl_page->epc_page);
+	sgx_record_epc_page(encl_page->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -687,7 +688,7 @@ void sgx_encl_release(struct kref *ref)
 			 * The page and its radix tree entry cannot be freed
 			 * if the page is being held by the reclaimer.
 			 */
-			if (sgx_unmark_page_reclaimable(entry->epc_page))
+			if (sgx_drop_epc_page(entry->epc_page))
 				continue;
 
 			sgx_encl_free_epc_page(entry->epc_page);
@@ -703,6 +704,7 @@ void sgx_encl_release(struct kref *ref)
 	xa_destroy(&encl->page_array);
 
 	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_drop_epc_page(encl->secs.epc_page);
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 	}
@@ -711,6 +713,7 @@ void sgx_encl_release(struct kref *ref)
 		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
 					   list);
 		list_del(&va_page->list);
+		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		kfree(va_page);
 	}
@@ -1218,6 +1221,7 @@ struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
+	sgx_record_epc_page(epc_page, 0);
 
 	return epc_page;
 }
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 9a1bb3c3211a..aca80a3f38a1 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -48,6 +48,7 @@ void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
 	encl->page_cnt--;
 
 	if (va_page) {
+		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		list_del(&va_page->list);
 		kfree(va_page);
@@ -113,6 +114,8 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes = secs->attributes;
 	encl->attributes_mask = SGX_ATTR_DEBUG | SGX_ATTR_MODE64BIT | SGX_ATTR_KSS;
 
+	sgx_record_epc_page(encl->secs.epc_page, 0);
+
 	/* Set only after completion, as encl->lock has not been taken. */
 	set_bit(SGX_ENCL_CREATED, &encl->flags);
 
@@ -322,7 +325,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_mark_page_reclaimable(encl_page->epc_page);
+	sgx_record_epc_page(encl_page->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -958,7 +961,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 			 * Prevent page from being reclaimed while mutex
 			 * is released.
 			 */
-			if (sgx_unmark_page_reclaimable(entry->epc_page)) {
+			if (sgx_drop_epc_page(entry->epc_page)) {
 				ret = -EAGAIN;
 				goto out_entry_changed;
 			}
@@ -973,7 +976,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 
 			mutex_lock(&encl->lock);
 
-			sgx_mark_page_reclaimable(entry->epc_page);
+			sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
 		}
 
 		/* Change EPC type */
@@ -1130,7 +1133,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
 			goto out_unlock;
 		}
 
-		if (sgx_unmark_page_reclaimable(entry->epc_page)) {
+		if (sgx_drop_epc_page(entry->epc_page)) {
 			ret = -EBUSY;
 			goto out_unlock;
 		}
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 447cf4b8580c..ecd7f8e704cc 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -262,7 +262,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 			goto out;
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
-
+		sgx_drop_epc_page(encl->secs.epc_page);
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
@@ -499,31 +499,35 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_record_epc_page() - Add a page to the LRU tracking
  * @page:	EPC page
  *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
+ * Mark a page with the specified flags and add it to the appropriate
+ * (un)reclaimable list.
  */
-void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
+void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
 	spin_lock(&sgx_global_lru.lock);
-	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	sgx_epc_push_reclaimable(&sgx_global_lru, page);
+	WARN_ON(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	page->flags |= flags;
+	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
+		sgx_epc_push_reclaimable(&sgx_global_lru, page);
+	else
+		sgx_epc_push_unreclaimable(&sgx_global_lru, page);
 	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_drop_epc_page() - Remove a page from a LRU list
  * @page:	EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
  *   -EBUSY if the page is in the process of being reclaimed
  */
-int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
+int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
@@ -533,9 +537,9 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 			return -EBUSY;
 		}
 
-		list_del(&page->list);
 		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
+	list_del(&page->list);
 	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 5e6d88438fae..ba4338b7303f 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -159,8 +159,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
 void sgx_reclaim_direct(void);
-void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
-int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
+void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
+int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 
 void sgx_ipi_cb(void *info);
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 776ae5c1c032..0eabc4db91d0 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -64,6 +64,8 @@ static int __sgx_vepc_fault(struct sgx_vepc *vepc,
 		goto err_delete;
 	}
 
+	sgx_record_epc_page(epc_page, 0);
+
 	return 0;
 
 err_delete:
@@ -148,6 +150,7 @@ static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
 		return ret;
 	}
 
+	sgx_drop_epc_page(epc_page);
 	sgx_free_epc_page(epc_page);
 	return 0;
 }
@@ -220,8 +223,15 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
 		 * have been removed, the SECS page must have a child on
 		 * another instance.
 		 */
-		if (sgx_vepc_free_page(epc_page))
+		if (sgx_vepc_free_page(epc_page)) {
+			/*
+			 * Drop the page before adding it to the list of SECS
+			 * pages.  Moving the page off the unreclaimable list
+			 * needs to be done under the LRU's spinlock.
+			 */
+			sgx_drop_epc_page(epc_page);
 			list_add_tail(&epc_page->list, &secs_pages);
+		}
 
 		xa_erase(&vepc->page_array, index);
 	}
@@ -236,15 +246,17 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
 	mutex_lock(&zombie_secs_pages_lock);
 	list_for_each_entry_safe(epc_page, tmp, &zombie_secs_pages, list) {
 		/*
-		 * Speculatively remove the page from the list of zombies,
-		 * if the page is successfully EREMOVE'd it will be added to
-		 * the list of free pages.  If EREMOVE fails, throw the page
-		 * on the local list, which will be spliced on at the end.
+		 * If EREMOVE fails, throw the page on the local list, which
+		 * will be spliced on at the end.
+		 *
+		 * Note, this abuses sgx_drop_epc_page() to delete the page off
+		 * the list of zombies, but this is a very rare path (probably
+		 * never hit in production).  It's not worth special casing the
+		 * free path for this super rare case just to avoid taking the
+		 * LRU's spinlock.
 		 */
-		list_del(&epc_page->list);
-
 		if (sgx_vepc_free_page(epc_page))
-			list_add_tail(&epc_page->list, &secs_pages);
+			list_move_tail(&epc_page->list, &secs_pages);
 	}
 
 	if (!list_empty(&secs_pages))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 06/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (4 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 05/18] x86/sgx: Track epc pages on reclaimable or unreclaimable lists Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-02 22:15   ` Dave Hansen
  2022-12-08 15:46   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 07/18] x86/sgx: Use a list to track to-be-reclaimed pages during reclaim Kristen Carlson Accardi
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

When selecting pages to be reclaimed from the page pool (sgx_global_lru),
the list of reclaimable pages is walked, and any page that is both
reclaimable and not in the process of being freed is added to a list of
potential candidates to be reclaimed. After that, this separate list is
further examined and may or may not ultimately be reclaimed. In order
to prevent this page from being removed from the sgx_epc_lru_lists
struct in a separate thread by sgx_drop_epc_page(), keep track of
whether the EPC page is in the middle of being reclaimed with
the addtion of a RECLAIM_IN_PROGRESS flag, and do not delete the page
off the LRU in sgx_drop_epc_page() if it has not yet finished being
reclaimed.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 15 ++++++++++-----
 arch/x86/kernel/cpu/sgx/sgx.h  |  2 ++
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index ecd7f8e704cc..bad72498b0a7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -305,13 +305,15 @@ static void __sgx_reclaim_pages(void)
 
 		encl_page = epc_page->encl_owner;
 
-		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
+		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
+			epc_page->flags |= SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
 			chunk[cnt++] = epc_page;
-		else
+		} else {
 			/* The owner is freeing the page. No need to add the
 			 * page back to the list of reclaimable pages.
 			 */
 			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+		}
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -337,6 +339,7 @@ static void __sgx_reclaim_pages(void)
 
 skip:
 		spin_lock(&sgx_global_lru.lock);
+		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
 		sgx_epc_push_reclaimable(&sgx_global_lru, epc_page);
 		spin_unlock(&sgx_global_lru.lock);
 
@@ -360,7 +363,8 @@ static void __sgx_reclaim_pages(void)
 		sgx_reclaimer_write(epc_page, &backing[i]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
-		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+		epc_page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED |
+				     SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
 
 		sgx_free_epc_page(epc_page);
 	}
@@ -508,7 +512,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
 	spin_lock(&sgx_global_lru.lock);
-	WARN_ON(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	WARN_ON(page->flags & (SGX_EPC_PAGE_RECLAIMER_TRACKED |
+			       SGX_EPC_PAGE_RECLAIM_IN_PROGRESS));
 	page->flags |= flags;
 	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
 		sgx_epc_push_reclaimable(&sgx_global_lru, page);
@@ -532,7 +537,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
 	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
 		/* The page is being reclaimed. */
-		if (list_empty(&page->list)) {
+		if (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS) {
 			spin_unlock(&sgx_global_lru.lock);
 			return -EBUSY;
 		}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index ba4338b7303f..37d66bc6ca27 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -30,6 +30,8 @@
 #define SGX_EPC_PAGE_IS_FREE		BIT(1)
 /* Pages allocated for KVM guest */
 #define SGX_EPC_PAGE_KVM_GUEST		BIT(2)
+/* page flag to indicate reclaim is in progress */
+#define SGX_EPC_PAGE_RECLAIM_IN_PROGRESS BIT(3)
 
 struct sgx_epc_page {
 	unsigned int section;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 07/18] x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (5 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 06/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-02 22:33   ` Dave Hansen
  2022-12-02 18:36 ` [PATCH v2 08/18] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default Kristen Carlson Accardi
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Change sgx_reclaim_pages() to use a list rather than an array for
storing the epc_pages which will be reclaimed. This change is needed
to transition to the LRU implementation for EPC cgroup support.

This change requires keeping track of whether newly recorded
EPC pages are pages for VA Arrays, or for Enclave data. In addition,
helper functions are added to move pages from one list to another and
enforce a consistent queue like behavior for the LRU lists.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/encl.c  |  7 ++--
 arch/x86/kernel/cpu/sgx/ioctl.c |  5 ++-
 arch/x86/kernel/cpu/sgx/main.c  | 69 +++++++++++++++++----------------
 arch/x86/kernel/cpu/sgx/sgx.h   | 42 ++++++++++++++++++++
 4 files changed, 85 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 4683da9ef4f1..9ee306ac2a8e 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -252,7 +252,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		epc_page = sgx_encl_eldu(&encl->secs, NULL);
 		if (IS_ERR(epc_page))
 			return ERR_CAST(epc_page);
-		sgx_record_epc_page(epc_page, 0);
+		sgx_record_epc_page(epc_page, SGX_EPC_PAGE_ENCLAVE);
 	}
 
 	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
@@ -260,7 +260,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	sgx_record_epc_page(entry->epc_page,
+			    (SGX_EPC_PAGE_ENCLAVE | SGX_EPC_PAGE_RECLAIMER_TRACKED));
 
 	return entry;
 }
@@ -1221,7 +1222,7 @@ struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
-	sgx_record_epc_page(epc_page, 0);
+	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_VERSION_ARRAY);
 
 	return epc_page;
 }
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index aca80a3f38a1..c3a9bffbc37e 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -114,7 +114,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes = secs->attributes;
 	encl->attributes_mask = SGX_ATTR_DEBUG | SGX_ATTR_MODE64BIT | SGX_ATTR_KSS;
 
-	sgx_record_epc_page(encl->secs.epc_page, 0);
+	sgx_record_epc_page(encl->secs.epc_page, SGX_EPC_PAGE_ENCLAVE);
 
 	/* Set only after completion, as encl->lock has not been taken. */
 	set_bit(SGX_ENCL_CREATED, &encl->flags);
@@ -325,7 +325,8 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_record_epc_page(encl_page->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	sgx_record_epc_page(encl_page->epc_page,
+			    (SGX_EPC_PAGE_ENCLAVE | SGX_EPC_PAGE_RECLAIMER_TRACKED));
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index bad72498b0a7..83aaf5cea7b9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -288,37 +288,43 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  */
 static void __sgx_reclaim_pages(void)
 {
-	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
 	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
-	struct sgx_epc_page *epc_page;
 	pgoff_t page_index;
-	int cnt = 0;
+	LIST_HEAD(iso);
 	int ret;
 	int i;
 
 	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		epc_page = sgx_epc_pop_reclaimable(&sgx_global_lru);
+		epc_page = sgx_epc_peek_reclaimable(&sgx_global_lru);
 		if (!epc_page)
 			break;
 
 		encl_page = epc_page->encl_owner;
 
+		if (WARN_ON_ONCE(!(epc_page->flags & SGX_EPC_PAGE_ENCLAVE)))
+			continue;
+
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
 			epc_page->flags |= SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
-			chunk[cnt++] = epc_page;
+			list_move_tail(&epc_page->list, &iso);
 		} else {
-			/* The owner is freeing the page. No need to add the
-			 * page back to the list of reclaimable pages.
+			/* The owner is freeing the page, remove it from the
+			 * LRU list
 			 */
 			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+			list_del_init(&epc_page->list);
 		}
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
+	if (list_empty(&iso))
+		return;
+
+	i = 0;
+	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_owner;
 
 		if (!sgx_reclaimer_age(epc_page))
@@ -333,6 +339,7 @@ static void __sgx_reclaim_pages(void)
 			goto skip;
 		}
 
+		i++;
 		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
 		mutex_unlock(&encl_page->encl->lock);
 		continue;
@@ -340,31 +347,25 @@ static void __sgx_reclaim_pages(void)
 skip:
 		spin_lock(&sgx_global_lru.lock);
 		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
-		sgx_epc_push_reclaimable(&sgx_global_lru, epc_page);
+		sgx_epc_move_reclaimable(&sgx_global_lru, epc_page);
 		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
-
-		chunk[i] = NULL;
 	}
 
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
-		if (epc_page)
-			sgx_reclaimer_block(epc_page);
-	}
-
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
-		if (!epc_page)
-			continue;
-
+	list_for_each_entry(epc_page, &iso, list)
+		sgx_reclaimer_block(epc_page);
+ 
+	i = 0;
+	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_owner;
-		sgx_reclaimer_write(epc_page, &backing[i]);
+		sgx_reclaimer_write(epc_page, &backing[i++]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 		epc_page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED |
-				     SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
+				     SGX_EPC_PAGE_RECLAIM_IN_PROGRESS |
+				     SGX_EPC_PAGE_ENCLAVE |
+				     SGX_EPC_PAGE_VERSION_ARRAY);
 
 		sgx_free_epc_page(epc_page);
 	}
@@ -505,6 +506,7 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 /**
  * sgx_record_epc_page() - Add a page to the LRU tracking
  * @page:	EPC page
+ * @flags:	Reclaim flags for the page.
  *
  * Mark a page with the specified flags and add it to the appropriate
  * (un)reclaimable list.
@@ -535,18 +537,19 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
-	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
-		/* The page is being reclaimed. */
-		if (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS) {
-			spin_unlock(&sgx_global_lru.lock);
-			return -EBUSY;
-		}
-
-		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+	if ((page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) &&
+	    (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS)) {
+		spin_unlock(&sgx_global_lru.lock);
+		return -EBUSY;
 	}
 	list_del(&page->list);
 	spin_unlock(&sgx_global_lru.lock);
 
+	page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED |
+			 SGX_EPC_PAGE_RECLAIM_IN_PROGRESS |
+			 SGX_EPC_PAGE_ENCLAVE |
+			 SGX_EPC_PAGE_VERSION_ARRAY);
+
 	return 0;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 37d66bc6ca27..ec8d567cd975 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -32,6 +32,8 @@
 #define SGX_EPC_PAGE_KVM_GUEST		BIT(2)
 /* page flag to indicate reclaim is in progress */
 #define SGX_EPC_PAGE_RECLAIM_IN_PROGRESS BIT(3)
+#define SGX_EPC_PAGE_ENCLAVE		BIT(4)
+#define SGX_EPC_PAGE_VERSION_ARRAY	BIT(5)
 
 struct sgx_epc_page {
 	unsigned int section;
@@ -118,6 +120,14 @@ static inline void __sgx_epc_page_list_push(struct list_head *list, struct sgx_e
 	list_add_tail(&page->list, list);
 }
 
+/*
+ * Must be called with queue lock acquired
+ */
+static inline void __sgx_epc_page_list_move(struct list_head *list, struct sgx_epc_page *page)
+{
+	list_move_tail(&page->list, list);
+}
+
 /*
  * Must be called with queue lock acquired
  */
@@ -157,6 +167,38 @@ static inline void sgx_epc_push_unreclaimable(struct sgx_epc_lru_lists *lrus,
 	__sgx_epc_page_list_push(&(lrus)->unreclaimable, page);
 }
 
+/*
+ * Must be called with queue lock acquired
+ */
+static inline struct sgx_epc_page * __sgx_epc_page_list_peek(struct list_head *list)
+{
+	struct sgx_epc_page *epc_page;
+
+	if (list_empty(list))
+		return NULL;
+
+	epc_page = list_first_entry(list, struct sgx_epc_page, list);
+	return epc_page;
+}
+
+static inline struct sgx_epc_page *
+sgx_epc_peek_reclaimable(struct sgx_epc_lru_lists *lrus)
+{
+	return __sgx_epc_page_list_peek(&(lrus)->reclaimable);
+}
+
+static inline void sgx_epc_move_reclaimable(struct sgx_epc_lru_lists *lru,
+					    struct sgx_epc_page *page)
+{
+	__sgx_epc_page_list_move(&(lru)->reclaimable, page);
+}
+
+static inline struct sgx_epc_page *
+sgx_epc_peek_unreclaimable(struct sgx_epc_lru_lists *lrus)
+{
+	return __sgx_epc_page_list_peek(&(lrus)->unreclaimable);
+}
+
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 08/18] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (6 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 07/18] x86/sgx: Use a list to track to-be-reclaimed pages during reclaim Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-08  9:26   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 09/18] x86/sgx: Return the number of EPC pages that were successfully reclaimed Kristen Carlson Accardi
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Modify sgx_reclaim_pages() to take a parameter that specifies the
number of pages to scan for reclaiming. Specify a max value of
32, but scan 16 in the usual case. This allows the number of pages
sgx_reclaim_pages() scans to be specified by the caller, and adjusted
in future patches.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 83aaf5cea7b9..f201ca85212f 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -18,6 +18,8 @@
 #include "encl.h"
 #include "encls.h"
 
+#define SGX_MAX_NR_TO_RECLAIM	32
+
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
 static struct task_struct *ksgxd_tsk;
@@ -273,7 +275,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
-/*
+/**
+ * sgx_reclaim_pages() - Reclaim EPC pages from the consumers
+ * @nr_to_scan:		 Number of EPC pages to scan for reclaim
+ *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
  * been accessed since the last scan. Move those pages to the tail of active
@@ -286,9 +291,9 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static void __sgx_reclaim_pages(void)
+static void __sgx_reclaim_pages(int nr_to_scan)
 {
-	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_backing backing[SGX_MAX_NR_TO_RECLAIM];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
 	pgoff_t page_index;
@@ -297,7 +302,7 @@ static void __sgx_reclaim_pages(void)
 	int i;
 
 	spin_lock(&sgx_global_lru.lock);
-	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
+	for (i = 0; i < nr_to_scan; i++) {
 		epc_page = sgx_epc_peek_reclaimable(&sgx_global_lru);
 		if (!epc_page)
 			break;
@@ -327,7 +332,7 @@ static void __sgx_reclaim_pages(void)
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_owner;
 
-		if (!sgx_reclaimer_age(epc_page))
+		if (i == SGX_MAX_NR_TO_RECLAIM || !sgx_reclaimer_age(epc_page))
 			goto skip;
 
 		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
@@ -371,9 +376,9 @@ static void __sgx_reclaim_pages(void)
 	}
 }
 
-static void sgx_reclaim_pages(void)
+static void sgx_reclaim_pages(int nr_to_scan)
 {
-	__sgx_reclaim_pages();
+	__sgx_reclaim_pages(nr_to_scan);
 	cond_resched();
 }
 
@@ -393,7 +398,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		__sgx_reclaim_pages();
+		__sgx_reclaim_pages(SGX_NR_TO_SCAN);
 }
 
 static int ksgxd(void *p)
@@ -419,7 +424,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages();
+			sgx_reclaim_pages(SGX_NR_TO_SCAN);
 	}
 
 	return 0;
@@ -598,7 +603,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages();
+		sgx_reclaim_pages(SGX_NR_TO_SCAN);
 	}
 
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 09/18] x86/sgx: Return the number of EPC pages that were successfully reclaimed
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (7 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 08/18] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-08  9:30   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 10/18] x86/sgx: Add option to ignore age of page during EPC reclaim Kristen Carlson Accardi
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Return the number of reclaimed pages from sgx_reclaim_pages(), the EPC
cgroup will use the result to track the success rate of its reclaim
calls, e.g. to escalate to a more forceful reclaiming mode if necessary.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index f201ca85212f..a4a65eadfb79 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -291,7 +291,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static void __sgx_reclaim_pages(int nr_to_scan)
+static int __sgx_reclaim_pages(int nr_to_scan)
 {
 	struct sgx_backing backing[SGX_MAX_NR_TO_RECLAIM];
 	struct sgx_epc_page *epc_page, *tmp;
@@ -326,7 +326,7 @@ static void __sgx_reclaim_pages(int nr_to_scan)
 	spin_unlock(&sgx_global_lru.lock);
 
 	if (list_empty(&iso))
-		return;
+		return 0;
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
@@ -374,12 +374,16 @@ static void __sgx_reclaim_pages(int nr_to_scan)
 
 		sgx_free_epc_page(epc_page);
 	}
+	return i;
 }
 
-static void sgx_reclaim_pages(int nr_to_scan)
+static int sgx_reclaim_pages(int nr_to_scan)
 {
-	__sgx_reclaim_pages(nr_to_scan);
+	int ret;
+
+	ret = __sgx_reclaim_pages(nr_to_scan);
 	cond_resched();
+	return ret;
 }
 
 static bool sgx_should_reclaim(unsigned long watermark)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 10/18] x86/sgx: Add option to ignore age of page during EPC reclaim
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (8 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 09/18] x86/sgx: Return the number of EPC pages that were successfully reclaimed Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-08  9:37   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 11/18] x86/sgx: Prepare for multiple LRUs Kristen Carlson Accardi
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a flag to sgx_reclaim_pages() to instruct it to ignore the age of
page, i.e. reclaim the page even if it's young.  The EPC cgroup will use
the flag to enforce its limits by draining the reclaimable lists before
resorting to other measures, e.g. forcefully reclaimable "unreclaimable"
pages by killing enclaves.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 46 +++++++++++++++++++++-------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index a4a65eadfb79..db96483e2e74 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -31,6 +31,10 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  * with sgx_global_lru.lock acquired.
  */
 static struct sgx_epc_lru_lists sgx_global_lru;
+static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
+{
+	return &sgx_global_lru;
+}
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -278,6 +282,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 /**
  * sgx_reclaim_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
+ * @ignore_age:		 Reclaim a page even if it is young
  *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
@@ -291,11 +296,12 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static int __sgx_reclaim_pages(int nr_to_scan)
+static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
 {
 	struct sgx_backing backing[SGX_MAX_NR_TO_RECLAIM];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
+	struct sgx_epc_lru_lists *lru;
 	pgoff_t page_index;
 	LIST_HEAD(iso);
 	int ret;
@@ -332,7 +338,8 @@ static int __sgx_reclaim_pages(int nr_to_scan)
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_owner;
 
-		if (i == SGX_MAX_NR_TO_RECLAIM || !sgx_reclaimer_age(epc_page))
+		if (i == SGX_MAX_NR_TO_RECLAIM ||
+		    (!ignore_age && !sgx_reclaimer_age(epc_page)))
 			goto skip;
 
 		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
@@ -350,10 +357,11 @@ static int __sgx_reclaim_pages(int nr_to_scan)
 		continue;
 
 skip:
-		spin_lock(&sgx_global_lru.lock);
+		lru = sgx_lru_lists(epc_page);
+		spin_lock(&lru->lock);
 		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
-		sgx_epc_move_reclaimable(&sgx_global_lru, epc_page);
-		spin_unlock(&sgx_global_lru.lock);
+		sgx_epc_move_reclaimable(lru, epc_page);
+		spin_unlock(&lru->lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 	}
@@ -377,11 +385,11 @@ static int __sgx_reclaim_pages(int nr_to_scan)
 	return i;
 }
 
-static int sgx_reclaim_pages(int nr_to_scan)
+static int sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
 {
 	int ret;
 
-	ret = __sgx_reclaim_pages(nr_to_scan);
+	ret = __sgx_reclaim_pages(nr_to_scan, ignore_age);
 	cond_resched();
 	return ret;
 }
@@ -402,7 +410,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		__sgx_reclaim_pages(SGX_NR_TO_SCAN);
+		__sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
 }
 
 static int ksgxd(void *p)
@@ -428,7 +436,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages(SGX_NR_TO_SCAN);
+			sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
 	}
 
 	return 0;
@@ -522,15 +530,17 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
+
+	spin_lock(&lru->lock);
 	WARN_ON(page->flags & (SGX_EPC_PAGE_RECLAIMER_TRACKED |
 			       SGX_EPC_PAGE_RECLAIM_IN_PROGRESS));
 	page->flags |= flags;
 	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
-		sgx_epc_push_reclaimable(&sgx_global_lru, page);
+		sgx_epc_push_reclaimable(lru, page);
 	else
-		sgx_epc_push_unreclaimable(&sgx_global_lru, page);
-	spin_unlock(&sgx_global_lru.lock);
+		sgx_epc_push_unreclaimable(lru, page);
+	spin_unlock(&lru->lock);
 }
 
 /**
@@ -545,14 +555,16 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
  */
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
+
+	spin_lock(&lru->lock);
 	if ((page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) &&
 	    (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS)) {
-		spin_unlock(&sgx_global_lru.lock);
+		spin_unlock(&lru->lock);
 		return -EBUSY;
 	}
 	list_del(&page->list);
-	spin_unlock(&sgx_global_lru.lock);
+	spin_unlock(&lru->lock);
 
 	page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED |
 			 SGX_EPC_PAGE_RECLAIM_IN_PROGRESS |
@@ -607,7 +619,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages(SGX_NR_TO_SCAN);
+		sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
 	}
 
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 11/18] x86/sgx: Prepare for multiple LRUs
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (9 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 10/18] x86/sgx: Add option to ignore age of page during EPC reclaim Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-08  9:42   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 12/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Kristen Carlson Accardi
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add sgx_can_reclaim() wrapper so that in a subsequent patch, multiple LRUs
can be used cleanly.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index db96483e2e74..96399e2016a8 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -394,10 +394,15 @@ static int sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
 	return ret;
 }
 
+static bool sgx_can_reclaim(void)
+{
+	return !list_empty(&sgx_global_lru.reclaimable);
+}
+
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_global_lru.reclaimable);
+		sgx_can_reclaim();
 }
 
 /*
@@ -606,7 +611,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_global_lru.reclaimable))
+		if (!sgx_can_reclaim())
 			return ERR_PTR(-ENOMEM);
 
 		if (!reclaim) {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 12/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (10 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 11/18] x86/sgx: Prepare for multiple LRUs Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-08  9:46   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 13/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Kristen Carlson Accardi
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Expose the top-level reclaim function as sgx_reclaim_epc_pages() for use
by the upcoming EPC cgroup, which will initiate reclaim to enforce
changes to high/max limits.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 7 ++++---
 arch/x86/kernel/cpu/sgx/sgx.h  | 1 +
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 96399e2016a8..c947b4ae06f3 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -281,6 +281,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 
 /**
  * sgx_reclaim_pages() - Reclaim EPC pages from the consumers
+ * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
  * @ignore_age:		 Reclaim a page even if it is young
  *
@@ -385,7 +386,7 @@ static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
 	return i;
 }
 
-static int sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
+int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age)
 {
 	int ret;
 
@@ -441,7 +442,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
+			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 	}
 
 	return 0;
@@ -624,7 +625,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 	}
 
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index ec8d567cd975..ce859331ddf5 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -206,6 +206,7 @@ void sgx_reclaim_direct(void);
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 13/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (11 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 12/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-08  9:56   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 14/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Kristen Carlson Accardi
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Move the isolation loop into a standalone helper, sgx_isolate_pages(),
in preparation for existence of multiple LRUs.  Expose the helper to
other SGX code so that it can be called from the EPC cgroup code, e.g.
to isolate pages from a single cgroup LRU.  Exposing the isolation loop
allows the cgroup iteration logic to be wholly encapsulated within the
cgroup code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 68 +++++++++++++++++++++-------------
 arch/x86/kernel/cpu/sgx/sgx.h  |  2 +
 2 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c947b4ae06f3..a59550fa150b 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -280,7 +280,46 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 }
 
 /**
- * sgx_reclaim_pages() - Reclaim EPC pages from the consumers
+ * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
+ * @lru:	LRU from which to reclaim
+ * @nr_to_scan:	Number of pages to scan for reclaim
+ * @dst:	Destination list to hold the isolated pages
+ */
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, int *nr_to_scan,
+			   struct list_head *dst)
+{
+	struct sgx_encl_page *encl_page;
+	struct sgx_epc_page *epc_page;
+
+	spin_lock(&lru->lock);
+	for (; *nr_to_scan > 0; --(*nr_to_scan)) {
+		if (list_empty(&lru->reclaimable))
+			break;
+
+		epc_page = sgx_epc_peek_reclaimable(lru);
+		if (!epc_page)
+			break;
+
+		encl_page = epc_page->encl_owner;
+
+		if (WARN_ON_ONCE(!(epc_page->flags & SGX_EPC_PAGE_ENCLAVE)))
+			continue;
+
+		if (kref_get_unless_zero(&encl_page->encl->refcount)) {
+			epc_page->flags |= SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
+			list_move_tail(&epc_page->list, dst);
+		} else {
+			/* The owner is freeing the page, remove it from the
+			 * LRU list
+			 */
+			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+			list_del_init(&epc_page->list);
+		}
+	}
+	spin_unlock(&lru->lock);
+}
+
+/**
  * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
  * @ignore_age:		 Reclaim a page even if it is young
@@ -305,37 +344,14 @@ static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
 	struct sgx_epc_lru_lists *lru;
 	pgoff_t page_index;
 	LIST_HEAD(iso);
+	int i = 0;
 	int ret;
-	int i;
-
-	spin_lock(&sgx_global_lru.lock);
-	for (i = 0; i < nr_to_scan; i++) {
-		epc_page = sgx_epc_peek_reclaimable(&sgx_global_lru);
-		if (!epc_page)
-			break;
-
-		encl_page = epc_page->encl_owner;
 
-		if (WARN_ON_ONCE(!(epc_page->flags & SGX_EPC_PAGE_ENCLAVE)))
-			continue;
-
-		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
-			epc_page->flags |= SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
-			list_move_tail(&epc_page->list, &iso);
-		} else {
-			/* The owner is freeing the page, remove it from the
-			 * LRU list
-			 */
-			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
-			list_del_init(&epc_page->list);
-		}
-	}
-	spin_unlock(&sgx_global_lru.lock);
+	sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
 
 	if (list_empty(&iso))
 		return 0;
 
-	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_owner;
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index ce859331ddf5..4499a5d5547d 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -207,6 +207,8 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age);
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, int *nr_to_scan,
+			   struct list_head *dst);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 14/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (12 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 13/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-08 15:21   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 15/18] cgroup/misc: Add per resource callbacks for css events Kristen Carlson Accardi
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Introduce the OOM path for killing an enclave with the reclaimer
is no longer able to reclaim enough EPC pages. Find a victim enclave,
which will be an enclave with EPC pages remaining that are not
accessible to the reclaimer ("unreclaimable"). Once a victim is
identified, mark the enclave as OOM and zap the enclaves entire
page range. Release all the enclaves resources except for the
struct sgx_encl memory itself.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/encl.c |  74 +++++++++++++++---
 arch/x86/kernel/cpu/sgx/encl.h |   2 +
 arch/x86/kernel/cpu/sgx/main.c | 135 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h  |   1 +
 4 files changed, 201 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 9ee306ac2a8e..ba350b2961d1 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -623,7 +623,8 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
 	if (!encl)
 		return -EFAULT;
 
-	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags))
+	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags) ||
+	    test_bit(SGX_ENCL_OOM, &encl->flags))
 		return -EFAULT;
 
 	for (i = 0; i < len; i += cnt) {
@@ -669,16 +670,8 @@ const struct vm_operations_struct sgx_vm_ops = {
 	.access = sgx_vma_access,
 };
 
-/**
- * sgx_encl_release - Destroy an enclave instance
- * @ref:	address of a kref inside &sgx_encl
- *
- * Used together with kref_put(). Frees all the resources associated with the
- * enclave and the instance itself.
- */
-void sgx_encl_release(struct kref *ref)
+static void __sgx_encl_release(struct sgx_encl *encl)
 {
-	struct sgx_encl *encl = container_of(ref, struct sgx_encl, refcount);
 	struct sgx_va_page *va_page;
 	struct sgx_encl_page *entry;
 	unsigned long index;
@@ -713,7 +706,7 @@ void sgx_encl_release(struct kref *ref)
 	while (!list_empty(&encl->va_pages)) {
 		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
 					   list);
-		list_del(&va_page->list);
+		list_del_init(&va_page->list);
 		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		kfree(va_page);
@@ -729,10 +722,66 @@ void sgx_encl_release(struct kref *ref)
 	/* Detect EPC page leak's. */
 	WARN_ON_ONCE(encl->secs_child_cnt);
 	WARN_ON_ONCE(encl->secs.epc_page);
+}
+
+/**
+ * sgx_encl_release - Destroy an enclave instance
+ * @ref:	address of a kref inside &sgx_encl
+ *
+ * Used together with kref_put(). Frees all the resources associated with the
+ * enclave and the instance itself.
+ */
+void sgx_encl_release(struct kref *ref)
+{
+	struct sgx_encl *encl = container_of(ref, struct sgx_encl, refcount);
+
+	/* if the enclave was OOM killed previously, it just needs to be freed */
+	if (!test_bit(SGX_ENCL_OOM, &encl->flags))
+		__sgx_encl_release(encl);
 
 	kfree(encl);
 }
 
+/**
+ * sgx_encl_destroy - prepare the enclave for release
+ * @encl:	address of the sgx_encl to drain
+ *
+ * Used during oom kill to empty the mm_list entries after they have
+ * been zapped. Release the remaining enclave resources without freeing
+ * struct sgx_encl.
+ */
+void sgx_encl_destroy(struct sgx_encl *encl)
+{
+	struct sgx_encl_mm *encl_mm;
+
+	for ( ; ; )  {
+		spin_lock(&encl->mm_lock);
+
+		if (list_empty(&encl->mm_list)) {
+			encl_mm = NULL;
+		} else {
+			encl_mm = list_first_entry(&encl->mm_list,
+						   struct sgx_encl_mm, list);
+			list_del_rcu(&encl_mm->list);
+		}
+
+		spin_unlock(&encl->mm_lock);
+
+		/* The enclave is no longer mapped by any mm. */
+		if (!encl_mm)
+			break;
+
+		synchronize_srcu(&encl->srcu);
+		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+		kfree(encl_mm);
+
+		/* 'encl_mm' is gone, put encl_mm->encl reference: */
+		kref_put(&encl->refcount, sgx_encl_release);
+	}
+
+	__sgx_encl_release(encl);
+}
+
 /*
  * 'mm' is exiting and no longer needs mmu notifications.
  */
@@ -802,6 +851,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
 	struct sgx_encl_mm *encl_mm;
 	int ret;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return -ENOMEM;
+
 	/*
 	 * Even though a single enclave may be mapped into an mm more than once,
 	 * each 'mm' only appears once on encl->mm_list. This is guaranteed by
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 831d63f80f5a..f4935632e53a 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -39,6 +39,7 @@ enum sgx_encl_flags {
 	SGX_ENCL_DEBUG		= BIT(1),
 	SGX_ENCL_CREATED	= BIT(2),
 	SGX_ENCL_INITIALIZED	= BIT(3),
+	SGX_ENCL_OOM		= BIT(4),
 };
 
 struct sgx_encl_mm {
@@ -125,5 +126,6 @@ struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 					 unsigned long addr);
 struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
+void sgx_encl_destroy(struct sgx_encl *encl);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index a59550fa150b..70046c4e332a 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -677,6 +677,141 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
+static bool sgx_oom_get_ref(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl *encl;
+
+	if (epc_page->flags & SGX_EPC_PAGE_ENCLAVE)
+		encl = ((struct sgx_encl_page *)epc_page->encl_owner)->encl;
+	else if (epc_page->flags & SGX_EPC_PAGE_VERSION_ARRAY)
+		encl = epc_page->encl;
+	else
+		return false;
+
+	return kref_get_unless_zero(&encl->refcount);
+}
+
+static struct sgx_epc_page *sgx_oom_get_victim(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *epc_page, *tmp;
+
+	if (list_empty(&lru->unreclaimable))
+		return NULL;
+
+	list_for_each_entry_safe(epc_page, tmp, &lru->unreclaimable, list) {
+		list_del_init(&epc_page->list);
+
+		if (sgx_oom_get_ref(epc_page))
+			return epc_page;
+	}
+	return NULL;
+}
+
+static void sgx_epc_oom_zap(void *owner, struct mm_struct *mm, unsigned long start,
+			    unsigned long end, const struct vm_operations_struct *ops)
+{
+	struct vm_area_struct *vma, *tmp;
+	unsigned long vm_end;
+
+	vma = find_vma(mm, start);
+	if (!vma || vma->vm_ops != ops || vma->vm_private_data != owner ||
+	    vma->vm_start >= end)
+		return;
+
+	for (tmp = vma; tmp->vm_start < end; tmp = tmp->vm_next) {
+		do {
+			vm_end = tmp->vm_end;
+			tmp = tmp->vm_next;
+		} while (tmp && tmp->vm_ops == ops &&
+			 vma->vm_private_data == owner && tmp->vm_start < end);
+
+		zap_page_range(vma, vma->vm_start, vm_end - vma->vm_start);
+
+		if (!tmp)
+			break;
+	}
+}
+
+static void sgx_oom_encl(struct sgx_encl *encl)
+{
+	unsigned long mm_list_version;
+	struct sgx_encl_mm *encl_mm;
+	int idx;
+
+	set_bit(SGX_ENCL_OOM, &encl->flags);
+
+	if (!test_bit(SGX_ENCL_CREATED, &encl->flags))
+		goto out;
+
+	do {
+		mm_list_version = encl->mm_list_version;
+
+		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
+		smp_rmb();
+
+		idx = srcu_read_lock(&encl->srcu);
+
+		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+			if (!mmget_not_zero(encl_mm->mm))
+				continue;
+
+			mmap_read_lock(encl_mm->mm);
+
+			sgx_epc_oom_zap(encl, encl_mm->mm, encl->base,
+					encl->base + encl->size, &sgx_vm_ops);
+
+			mmap_read_unlock(encl_mm->mm);
+
+			mmput_async(encl_mm->mm);
+		}
+
+		srcu_read_unlock(&encl->srcu, idx);
+	} while (WARN_ON_ONCE(encl->mm_list_version != mm_list_version));
+
+	mutex_lock(&encl->lock);
+	sgx_encl_destroy(encl);
+	mutex_unlock(&encl->lock);
+
+out:
+	/*
+	 * This puts the refcount we took when we identified this enclave as
+	 * an OOM victim.
+	 */
+	kref_put(&encl->refcount, sgx_encl_release);
+}
+
+static inline void sgx_oom_encl_page(struct sgx_encl_page *encl_page)
+{
+	return sgx_oom_encl(encl_page->encl);
+}
+
+/**
+ * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
+ * @lru:	LRU that is low
+ *
+ * Return:	%true if a victim was found and kicked.
+ */
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *victim;
+
+	spin_lock(&lru->lock);
+	victim = sgx_oom_get_victim(lru);
+	spin_unlock(&lru->lock);
+
+	if (!victim)
+		return false;
+
+	if (victim->flags & SGX_EPC_PAGE_ENCLAVE)
+		sgx_oom_encl_page(victim->encl_owner);
+	else if (victim->flags & SGX_EPC_PAGE_VERSION_ARRAY)
+		sgx_oom_encl(victim->encl);
+	else
+		WARN_ON_ONCE(1);
+
+	return true;
+}
+
 static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 					 unsigned long index,
 					 struct sgx_epc_section *section)
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 4499a5d5547d..1c666b25294b 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -209,6 +209,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age);
 void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, int *nr_to_scan,
 			   struct list_head *dst);
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 15/18] cgroup/misc: Add per resource callbacks for css events
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (13 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 14/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-08 14:53   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 16/18] cgroup/misc: Prepare for SGX usage Kristen Carlson Accardi
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Zefan Li, Johannes Weiner
  Cc: zhiquan1.li, Kristen Carlson Accardi

Consumers of the misc cgroup controller might need to perform separate actions
in the event of a cgroup alloc, free or release call. In addition,
writes to the max value may also need separate action. Add the ability
to allow downstream users to setup callbacks for these operations, and
call the per resource type callback when appropriate.

This code will be utilized by the SGX driver in a future patch.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
---
 include/linux/misc_cgroup.h |  6 +++++
 kernel/cgroup/misc.c        | 51 ++++++++++++++++++++++++++++++++++---
 2 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index c238207d1615..83620e7c4bb1 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -37,6 +37,12 @@ struct misc_res {
 	unsigned long max;
 	atomic_long_t usage;
 	atomic_long_t events;
+
+	/* per resource callback ops */
+	int (*misc_cg_alloc)(struct misc_cg *cg);
+	void (*misc_cg_free)(struct misc_cg *cg);
+	void (*misc_cg_released)(struct misc_cg *cg);
+	void (*misc_cg_max_write)(struct misc_cg *cg);
 };
 
 /**
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index fe3e8a0eb7ed..3d17afd5b7a8 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -278,10 +278,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
 
 	cg = css_misc(of_css(of));
 
-	if (READ_ONCE(misc_res_capacity[type]))
+	if (READ_ONCE(misc_res_capacity[type])) {
 		WRITE_ONCE(cg->res[type].max, max);
-	else
+		if (cg->res[type].misc_cg_max_write)
+			cg->res[type].misc_cg_max_write(cg);
+	} else {
 		ret = -EINVAL;
+	}
 
 	return ret ? ret : nbytes;
 }
@@ -385,23 +388,39 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
+	struct misc_cg *parent_cg;
 	enum misc_res_type i;
 	struct misc_cg *cg;
+	int ret;
 
 	if (!parent_css) {
 		cg = &root_cg;
+		parent_cg = &root_cg;
 	} else {
 		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
 		if (!cg)
 			return ERR_PTR(-ENOMEM);
+		parent_cg = css_misc(parent_css);
 	}
 
 	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
 		WRITE_ONCE(cg->res[i].max, MAX_NUM);
 		atomic_long_set(&cg->res[i].usage, 0);
+		if (parent_cg->res[i].misc_cg_alloc) {
+			ret = parent_cg->res[i].misc_cg_alloc(cg);
+			if (ret)
+				goto alloc_err;
+		}
 	}
 
 	return &cg->css;
+
+alloc_err:
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (parent_cg->res[i].misc_cg_free)
+			cg->res[i].misc_cg_free(cg);
+	kfree(cg);
+	return ERR_PTR(ret);
 }
 
 /**
@@ -412,13 +431,39 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
  */
 static void misc_cg_free(struct cgroup_subsys_state *css)
 {
-	kfree(css_misc(css));
+	struct misc_cg *cg = css_misc(css);
+	enum misc_res_type i;
+
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (cg->res[i].misc_cg_free)
+			cg->res[i].misc_cg_free(cg);
+
+	kfree(cg);
+}
+
+/**
+ * misc_cg_released() - Release the misc cgroup
+ * @css: cgroup subsys object.
+ *
+ * Call the misc_cg resource type released callbacks.
+ *
+ * Context: Any context.
+ */
+static void misc_cg_released(struct cgroup_subsys_state *css)
+{
+	struct misc_cg *cg = css_misc(css);
+	enum misc_res_type i;
+
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (cg->res[i].misc_cg_released)
+			cg->res[i].misc_cg_released(cg);
 }
 
 /* Cgroup controller callbacks */
 struct cgroup_subsys misc_cgrp_subsys = {
 	.css_alloc = misc_cg_alloc,
 	.css_free = misc_cg_free,
+	.css_released = misc_cg_released,
 	.legacy_cftypes = misc_cg_files,
 	.dfl_cftypes = misc_cg_files,
 };
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 16/18] cgroup/misc: Prepare for SGX usage
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (14 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 15/18] cgroup/misc: Add per resource callbacks for css events Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-08 15:23   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 17/18] x86/sgx: Add support for misc cgroup controller Kristen Carlson Accardi
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Zefan Li, Johannes Weiner
  Cc: zhiquan1.li, Kristen Carlson Accardi

The SGX driver will need to get access to the root misc_cg object
to do iterative walks and also determine if a charge will be
towards the root cgroup or not.

To manage the SGX EPC memory via the misc controller, the SGX
driver will also need to be able to iterate over the misc cgroup
hierarchy.

Move parent_misc() into misc_cgroup.h and make inline to make this
function available to SGX, rename it to misc_cg_parent(), and update
misc.c to use the new name.

Add per resource type private data so that SGX can store additional
per cgroup data with the misc_cg struct.

Allow SGX EPC memory to be a valid resource type for the misc
controller.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
---
 include/linux/misc_cgroup.h | 29 +++++++++++++++++++++++++++++
 kernel/cgroup/misc.c        | 25 ++++++++++++-------------
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 83620e7c4bb1..53a64d3bb6d7 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
 	MISC_CG_RES_SEV,
 	/* AMD SEV-ES ASIDs resource */
 	MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+	/* SGX EPC memory resource */
+	MISC_CG_RES_SGX_EPC,
 #endif
 	MISC_CG_RES_TYPES
 };
@@ -37,6 +41,7 @@ struct misc_res {
 	unsigned long max;
 	atomic_long_t usage;
 	atomic_long_t events;
+	void *priv;
 
 	/* per resource callback ops */
 	int (*misc_cg_alloc)(struct misc_cg *cg);
@@ -59,6 +64,7 @@ struct misc_cg {
 	struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 unsigned long misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, unsigned long capacity);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
@@ -80,6 +86,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
 	return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -104,6 +124,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+	return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+	return NULL;
+}
 
 static inline unsigned long misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 3d17afd5b7a8..e1e506847dea 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
 	/* AMD SEV-ES ASIDs resource */
 	"sev_es",
 #endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+	/* Intel SGX EPC memory bytes */
+	"sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
@@ -40,18 +44,13 @@ static struct misc_cg root_cg;
 static unsigned long misc_res_capacity[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+	return &root_cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -151,7 +150,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
 	if (!amount)
 		return 0;
 
-	for (i = cg; i; i = parent_misc(i)) {
+	for (i = cg; i; i = misc_cg_parent(i)) {
 		res = &i->res[type];
 
 		new_usage = atomic_long_add_return(amount, &res->usage);
@@ -164,12 +163,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
 	return 0;
 
 err_charge:
-	for (j = i; j; j = parent_misc(j)) {
+	for (j = i; j; j = misc_cg_parent(j)) {
 		atomic_long_inc(&j->res[type].events);
 		cgroup_file_notify(&j->events_file);
 	}
 
-	for (j = cg; j != i; j = parent_misc(j))
+	for (j = cg; j != i; j = misc_cg_parent(j))
 		misc_cg_cancel_charge(type, j, amount);
 	misc_cg_cancel_charge(type, i, amount);
 	return ret;
@@ -192,7 +191,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg,
 	if (!(amount && valid_type(type) && cg))
 		return;
 
-	for (i = cg; i; i = parent_misc(i))
+	for (i = cg; i; i = misc_cg_parent(i))
 		misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 17/18] x86/sgx: Add support for misc cgroup controller
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (15 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 16/18] cgroup/misc: Prepare for SGX usage Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2022-12-08 15:30   ` Jarkko Sakkinen
  2022-12-02 18:36 ` [PATCH v2 18/18] Docs/x86/sgx: Add description for cgroup support Kristen Carlson Accardi
  2023-04-03 21:26 ` [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Anand Krishnamoorthi
  18 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson

Implement support for cgroup control of SGX Enclave Page Cache (EPC)
memory using the misc cgroup controller. EPC memory is independent
from normal system memory, e.g. must be reserved at boot from RAM and
cannot be converted between EPC and normal memory while the system is
running. EPC is managed by the SGX subsystem and is not accounted by
the memory controller.

Much like normal system memory, EPC memory can be overcommitted via
virtual memory techniques and pages can be swapped out of the EPC to
their backing store (normal system memory, e.g. shmem).  The SGX EPC
subsystem is analogous to the memory subsytem and the SGX EPC controller
is in turn analogous to the memory controller; it implements limit and
protection models for EPC memory.

The misc controller provides a mechanism to set a hard limit of EPC
usage via the "sgx_epc" resource in "misc.max". The total EPC memory
available on the system is reported via the "sgx_epc" resource in
"misc.capacity".

This patch was modified from its original version to use the misc cgroup
controller instead of a custom controller.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/Kconfig                     |  13 +
 arch/x86/kernel/cpu/sgx/Makefile     |   1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 539 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 +++
 arch/x86/kernel/cpu/sgx/main.c       |  86 ++++-
 arch/x86/kernel/cpu/sgx/sgx.h        |   6 +-
 6 files changed, 688 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f9920f1341c8..0eeae4ebe1c3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1936,6 +1936,19 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config CGROUP_SGX_EPC
+	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
+	depends on X86_SGX && CGROUP_MISC
+	help
+	  Provides control over the EPC footprint of tasks in a cgroup via
+	  the Miscellaneous cgroup controller.
+
+	  EPC is a subset of regular memory that is usable only by SGX
+	  enclaves and is very limited in quantity, e.g. less than 1%
+	  of total DRAM.
+
+          Say N if unsure.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
 	ioctl.o \
 	main.o
 obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC)	       += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..d668a67fde84
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,539 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/ratelimit.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/threads.h>
+
+#include "epc_cgroup.h"
+
+#define SGX_EPC_RECLAIM_MIN_PAGES		16UL
+#define SGX_EPC_RECLAIM_MAX_PAGES		64UL
+#define SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD	5
+#define SGX_EPC_RECLAIM_OOM_THRESHOLD		5
+
+static struct workqueue_struct *sgx_epc_cg_wq;
+
+struct sgx_epc_reclaim_control {
+	struct sgx_epc_cgroup *epc_cg;
+	int nr_fails;
+	bool ignore_age;
+};
+
+static inline unsigned long sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
+{
+	 return atomic_long_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline unsigned long sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+	 return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+	if (cg)
+		return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+
+	return NULL;
+}
+
+static inline struct sgx_epc_cgroup *parent_epc_cgroup(struct sgx_epc_cgroup *epc_cg)
+{
+	return sgx_epc_cgroup_from_misc_cg(misc_cg_parent(epc_cg->cg));
+}
+
+static inline bool sgx_epc_cgroup_disabled(void)
+{
+	return !cgroup_subsys_enabled(misc_cgrp_subsys);
+}
+
+/**
+ * sgx_epc_cgroup_iter - iterate over the EPC cgroup hierarchy
+ * @root:		hierarchy root
+ * @prev:		previously returned epc_cg, NULL on first invocation
+ * @reclaim_epoch:	epoch for shared reclaim walks, NULL for full walks
+ *
+ * Return: references to children of the hierarchy below @root, or
+ * @root itself, or %NULL after a full round-trip.
+ *
+ * Caller must pass the return value in @prev on subsequent invocations
+ * for reference counting, or use sgx_epc_cgroup_iter_break() to cancel
+ * a hierarchy walk before the round-trip is complete.
+ */
+static struct sgx_epc_cgroup *sgx_epc_cgroup_iter(struct sgx_epc_cgroup *prev,
+						  struct sgx_epc_cgroup *root,
+						  unsigned long *reclaim_epoch)
+{
+	struct cgroup_subsys_state *css = NULL;
+	struct sgx_epc_cgroup *epc_cg = NULL;
+	struct sgx_epc_cgroup *pos = NULL;
+	bool inc_epoch = false;
+
+	if (sgx_epc_cgroup_disabled())
+		return NULL;
+
+	if (!root)
+		root = sgx_epc_cgroup_from_misc_cg(misc_cg_root());
+
+	if (prev && !reclaim_epoch)
+		pos = prev;
+
+	rcu_read_lock();
+
+start:
+	if (reclaim_epoch) {
+		/*
+		 * Abort the walk if a reclaimer working from the same root has
+		 * started a new walk after this reclaimer has already scanned
+		 * at least one cgroup.
+		 */
+		if (prev && *reclaim_epoch != root->epoch)
+			goto out;
+
+		while (1) {
+			pos = READ_ONCE(root->reclaim_iter);
+			if (!pos || css_tryget(&pos->cg->css))
+				break;
+
+			/*
+			 * The css is dying, clear the reclaim_iter immediately
+			 * instead of waiting for ->css_released to be called.
+			 * Busy waiting serves no purpose and attempting to wait
+			 * for ->css_released may actually block it from being
+			 * called.
+			 */
+			(void)cmpxchg(&root->reclaim_iter, pos, NULL);
+		}
+	}
+
+	if (pos)
+		css = &pos->cg->css;
+
+	while (!epc_cg) {
+		struct misc_cg *cg;
+
+		css = css_next_descendant_pre(css, &root->cg->css);
+		if (!css) {
+			/*
+			 * Increment the epoch as we've reached the end of the
+			 * tree and the next call to css_next_descendant_pre
+			 * will restart at root.  Do not update root->epoch
+			 * directly as we should only do so if we update the
+			 * reclaim_iter, i.e. a different thread may win the
+			 * race and update the epoch for us.
+			 */
+			inc_epoch = true;
+
+			/*
+			 * Reclaimers share the hierarchy walk, and a new one
+			 * might jump in at the end of the hierarchy.  Restart
+			 * at root so that  we don't return NULL on a thread's
+			 * initial call.
+			 */
+			if (!prev)
+				continue;
+			break;
+		}
+
+		cg = css_misc(css);
+		/*
+		 * Verify the css and acquire a reference.  Don't take an
+		 * extra reference to root as it's either the global root
+		 * or is provided by the caller and so is guaranteed to be
+		 * alive.  Keep walking if this css is dying.
+		 */
+		if (cg != root->cg && !css_tryget(&cg->css))
+			continue;
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+	}
+
+	if (reclaim_epoch) {
+		/*
+		 * reclaim_iter could have already been updated by a competing
+		 * thread; check that the value hasn't changed since we read
+		 * it to avoid reclaiming from the same cgroup twice.  If the
+		 * value did change, put all of our references and restart the
+		 * entire process, for all intents and purposes we're making a
+		 * new call.
+		 */
+		if (cmpxchg(&root->reclaim_iter, pos, epc_cg) != pos) {
+			if (epc_cg && epc_cg != root)
+				put_misc_cg(epc_cg->cg);
+			if (pos)
+				put_misc_cg(pos->cg);
+			css = NULL;
+			epc_cg = NULL;
+			inc_epoch = false;
+			goto start;
+		}
+
+		if (inc_epoch)
+			root->epoch++;
+		if (!prev)
+			*reclaim_epoch = root->epoch;
+
+		if (pos)
+			put_misc_cg(pos->cg);
+	}
+
+out:
+	rcu_read_unlock();
+	if (prev && prev != root)
+		put_misc_cg(prev->cg);
+
+	return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_iter_break - abort a hierarchy walk prematurely
+ * @prev:	last visited cgroup as returned by sgx_epc_cgroup_iter()
+ * @root:	hierarchy root
+ */
+static void sgx_epc_cgroup_iter_break(struct sgx_epc_cgroup *prev,
+				      struct sgx_epc_cgroup *root)
+{
+	if (!root)
+		root = sgx_epc_cgroup_from_misc_cg(misc_cg_root());
+	if (prev && prev != root)
+		put_misc_cg(prev->cg);
+}
+
+/**
+ * sgx_epc_cgroup_lru_empty - check if a cgroup tree has no pages on its lrus
+ * @root:	root of the tree to check
+ *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	for (epc_cg = sgx_epc_cgroup_iter(NULL, root, NULL);
+	     epc_cg;
+	     epc_cg = sgx_epc_cgroup_iter(epc_cg, root, NULL)) {
+		if (!list_empty(&epc_cg->lru.reclaimable)) {
+			sgx_epc_cgroup_iter_break(epc_cg, root);
+			return false;
+		}
+	}
+	return true;
+}
+
+/**
+ * sgx_epc_cgroup_isolate_pages - walk a cgroup tree and separate pages
+ * @root:	root of the tree to start walking
+ * @nr_to_scan: The number of pages that need to be isolated
+ * @dst:	Destination list to hold the isolated pages
+ *
+ * Walk the cgroup tree and isolate the pages in the hierarchy
+ * for reclaiming.
+ */
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+				  int *nr_to_scan, struct list_head *dst)
+{
+        struct sgx_epc_cgroup *epc_cg;
+        unsigned long epoch;
+
+	if (!*nr_to_scan)
+		return;
+
+        for (epc_cg = sgx_epc_cgroup_iter(NULL, root, &epoch);
+             epc_cg;
+             epc_cg = sgx_epc_cgroup_iter(epc_cg, root, &epoch)) {
+                sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
+                if (!*nr_to_scan) {
+                        sgx_epc_cgroup_iter_break(epc_cg, root);
+                        break;
+                }
+        }
+}
+
+static int sgx_epc_cgroup_reclaim_pages(unsigned long nr_pages,
+					struct sgx_epc_reclaim_control *rc)
+{
+	/*
+	 * Ensure sgx_reclaim_pages is called with a minimum and maximum
+	 * number of pages.  Attempting to reclaim only a few pages will
+	 * often fail and is inefficient, while reclaiming a huge number
+	 * of pages can result in soft lockups due to holding various
+	 * locks for an extended duration.  This also bounds nr_pages so
+	 */
+	nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
+	nr_pages = min(nr_pages, SGX_EPC_RECLAIM_MAX_PAGES);
+
+	return sgx_reclaim_epc_pages(nr_pages, rc->ignore_age, rc->epc_cg);
+}
+
+static int sgx_epc_cgroup_reclaim_failed(struct sgx_epc_reclaim_control *rc)
+{
+	if (sgx_epc_cgroup_lru_empty(rc->epc_cg))
+		return -ENOMEM;
+
+	++rc->nr_fails;
+	if (rc->nr_fails > SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD)
+		rc->ignore_age = true;
+
+	return 0;
+}
+
+static inline
+void sgx_epc_reclaim_control_init(struct sgx_epc_reclaim_control *rc,
+				  struct sgx_epc_cgroup *epc_cg)
+{
+	rc->epc_cg = epc_cg;
+	rc->nr_fails = 0;
+	rc->ignore_age = false;
+}
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
+ * cgroup when the cgroup is at/near its maximum capacity
+ */
+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+	struct sgx_epc_reclaim_control rc;
+	struct sgx_epc_cgroup *epc_cg;
+	unsigned long cur, max;
+
+	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+	for (;;) {
+		max = sgx_epc_cgroup_max_pages(epc_cg);
+
+		/*
+		 * Adjust the limit down by one page, the goal is to free up
+		 * pages for fault allocations, not to simply obey the limit.
+		 * Conditionally decrementing max also means the cur vs. max
+		 * check will correctly handle the case where both are zero.
+		 */
+		if (max)
+			max--;
+
+		/*
+		 * Unless the limit is extremely low, in which case forcing
+		 * reclaim will likely cause thrashing, force the cgroup to
+		 * reclaim at least once if it's operating *near* its maximum
+		 * limit by adjusting @max down by half the min reclaim size.
+		 * This work func is scheduled by sgx_epc_cgroup_try_charge
+		 * when it cannot directly reclaim due to being in an atomic
+		 * context, e.g. EPC allocation in a fault handler.  Waiting
+		 * to reclaim until the cgroup is actually at its limit is less
+		 * performant as it means the faulting task is effectively
+		 * blocked until a worker makes its way through the global work
+		 * queue.
+		 */
+		if (max > SGX_EPC_RECLAIM_MAX_PAGES)
+			max -= (SGX_EPC_RECLAIM_MIN_PAGES/2);
+
+		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+		if (cur <= max)
+			break;
+
+		if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc))
+				break;
+		}
+	}
+}
+
+static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
+				       unsigned long nr_pages, bool reclaim)
+{
+	struct sgx_epc_reclaim_control rc;
+	unsigned long cur, max, over;
+	unsigned int nr_empty = 0;
+
+	if (epc_cg == sgx_epc_cgroup_from_misc_cg(misc_cg_root())) {
+		misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+				   nr_pages * PAGE_SIZE);
+		return 0;
+	}
+
+	sgx_epc_reclaim_control_init(&rc, NULL);
+
+	for (;;) {
+		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+					nr_pages * PAGE_SIZE))
+			break;
+
+		rc.epc_cg = epc_cg;
+		max = sgx_epc_cgroup_max_pages(rc.epc_cg);
+		if (nr_pages > max)
+			return -ENOMEM;
+
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+
+		if (!reclaim) {
+			queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
+			return -EBUSY;
+		}
+
+		cur = sgx_epc_cgroup_page_counter_read(rc.epc_cg);
+		over = ((cur + nr_pages) > max) ?
+			(cur + nr_pages) - max : SGX_EPC_RECLAIM_MIN_PAGES;
+
+		if (!sgx_epc_cgroup_reclaim_pages(over, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
+				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
+					return -ENOMEM;
+				schedule();
+			}
+		}
+	}
+
+	css_get_many(&epc_cg->cg->css, nr_pages);
+
+	return 0;
+}
+
+
+/**
+ * sgx_epc_cgroup_try_charge - hierarchically try to charge a single EPC page
+ * @mm:			the mm_struct of the process to charge
+ * @reclaim:		whether or not synchronous reclaim is allowed
+ *
+ * Returns EPC cgroup or NULL on success, -errno on failure.
+ */
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(struct mm_struct *mm,
+						 bool reclaim)
+{
+	struct sgx_epc_cgroup *epc_cg;
+	int ret;
+
+	if (sgx_epc_cgroup_disabled())
+		return NULL;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
+	ret = __sgx_epc_cgroup_try_charge(epc_cg, 1, reclaim);
+	put_misc_cg(epc_cg->cg);
+
+	if (ret)
+		return ERR_PTR(ret);
+
+	return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_uncharge - hierarchically uncharge EPC pages
+ * @epc_cg:	the charged epc cgroup
+ */
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
+{
+	if (sgx_epc_cgroup_disabled())
+		return;
+
+	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+
+	if (epc_cg->cg != misc_cg_root())
+		put_misc_cg(epc_cg->cg);
+}
+
+static void sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	for (epc_cg = sgx_epc_cgroup_iter(NULL, root, NULL);
+	     epc_cg;
+	     epc_cg = sgx_epc_cgroup_iter(epc_cg, root, NULL)) {
+		if (sgx_epc_oom(&epc_cg->lru)) {
+			sgx_epc_cgroup_iter_break(epc_cg, root);
+			return;
+		}
+	}
+}
+
+static void sgx_epc_cgroup_released(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *dead_cg;
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+	dead_cg = epc_cg;
+
+	while ((epc_cg = parent_epc_cgroup(epc_cg)))
+		cmpxchg(&epc_cg->reclaim_iter, dead_cg, NULL);
+}
+
+static void sgx_epc_cgroup_free(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+	cancel_work_sync(&epc_cg->reclaim_work);
+	kfree(epc_cg);
+}
+
+static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
+{
+	struct sgx_epc_reclaim_control rc;
+	struct sgx_epc_cgroup *epc_cg;
+	unsigned int nr_empty = 0;
+	unsigned long cur, max;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+	max = sgx_epc_cgroup_max_pages(epc_cg);
+
+	for (;;) {
+		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+		if (cur <= max)
+			break;
+
+		if (signal_pending(current))
+			break;
+
+		if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
+				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
+					sgx_epc_cgroup_oom(epc_cg);
+				schedule();
+			}
+		}
+	}
+}
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = kzalloc(sizeof(struct sgx_epc_cgroup), GFP_KERNEL);
+	if (!epc_cg)
+		return -ENOMEM;
+
+	sgx_lru_init(&epc_cg->lru);
+	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
+	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_alloc = sgx_epc_cgroup_alloc;
+	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_free = sgx_epc_cgroup_free;
+	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_released = sgx_epc_cgroup_released;
+	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_max_write = sgx_epc_cgroup_max_write;
+	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
+	epc_cg->cg = cg;
+	return 0;
+}
+
+static int __init sgx_epc_cgroup_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_SGX))
+		return 0;
+
+	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
+					WQ_UNBOUND | WQ_FREEZABLE,
+					WQ_UNBOUND_MAX_ACTIVE);
+	BUG_ON(!sgx_epc_cg_wq);
+
+	return sgx_epc_cgroup_alloc(misc_cg_root());
+}
+subsys_initcall(sgx_epc_cgroup_init);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..bc358934dbe2
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _INTEL_SGX_EPC_CGROUP_H_
+#define _INTEL_SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/misc_cgroup.h>
+#include <linux/page_counter.h>
+#include <linux/workqueue.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_epc_cgroup;
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(struct mm_struct *mm,
+							       bool reclaim)
+{
+	return NULL;
+}
+static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
+static inline void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+						int *nr_to_scan,
+						struct list_head *dst) { }
+static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+	return NULL;
+}
+static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	return true;
+}
+#else
+struct sgx_epc_cgroup {
+	struct misc_cg *cg;
+	struct sgx_epc_lru_lists	lru;
+	struct sgx_epc_cgroup	*reclaim_iter;
+	struct work_struct	reclaim_work;
+	unsigned int		epoch;
+};
+
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(struct mm_struct *mm,
+						 bool reclaim);
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root);
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+				  int *nr_to_scan, struct list_head *dst);
+static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+	if (epc_cg)
+		return &epc_cg->lru;
+	return NULL;
+}
+#endif
+
+#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 70046c4e332a..a9d5cfd4e024 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
 #include <linux/highmem.h>
 #include <linux/kthread.h>
 #include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
 #include <linux/node.h>
 #include <linux/pagemap.h>
 #include <linux/ratelimit.h>
@@ -17,6 +18,7 @@
 #include "driver.h"
 #include "encl.h"
 #include "encls.h"
+#include "epc_cgroup.h"
 
 #define SGX_MAX_NR_TO_RECLAIM	32
 
@@ -33,9 +35,20 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 static struct sgx_epc_lru_lists sgx_global_lru;
 static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
 {
+        if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		return epc_cg_lru(epc_page->epc_cg);
+
 	return &sgx_global_lru;
 }
 
+static inline bool sgx_can_reclaim(void)
+{
+        if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		return !list_empty(&sgx_global_lru.reclaimable);
+
+	return !sgx_epc_cgroup_lru_empty(NULL);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -320,9 +333,10 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, int *nr_to_scan,
 }
 
 /**
- * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
+ * __sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
  * @ignore_age:		 Reclaim a page even if it is young
+ * @epc_cg:		 EPC cgroup from which to reclaim
  *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
@@ -336,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, int *nr_to_scan,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
+static int __sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age,
+			  struct sgx_epc_cgroup *epc_cg)
 {
 	struct sgx_backing backing[SGX_MAX_NR_TO_RECLAIM];
 	struct sgx_epc_page *epc_page, *tmp;
@@ -347,7 +362,15 @@ static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
 	int i = 0;
 	int ret;
 
-	sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
+        /*
+         * If a specific cgroup is not being targetted, take from the global
+         * list first, even when cgroups are enabled.  If there are
+         * pages on the global LRU then they should get reclaimed asap.
+         */
+        if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
+                sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
+
+	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
 
 	if (list_empty(&iso))
 		return 0;
@@ -397,25 +420,33 @@ static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
 				     SGX_EPC_PAGE_ENCLAVE |
 				     SGX_EPC_PAGE_VERSION_ARRAY);
 
+		if (epc_page->epc_cg) {
+			sgx_epc_cgroup_uncharge(epc_page->epc_cg);
+			epc_page->epc_cg = NULL;
+		}
+
 		sgx_free_epc_page(epc_page);
 	}
 	return i;
 }
 
-int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age)
+/**
+ * sgx_reclaim_epc_pages() - wrapper for __sgx_reclaim_epc_pages which
+ * 			     calls cond_resched() upon completion.
+ * @nr_to_scan:		Number of EPC pages to scan for reclaim
+ * @ignore_age:		Reclaim a page even if it is young
+ * @epc_cg:		EPC cgroup from which to reclaim
+ */
+int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age,
+			  struct sgx_epc_cgroup *epc_cg)
 {
 	int ret;
 
-	ret = __sgx_reclaim_pages(nr_to_scan, ignore_age);
+	ret = __sgx_reclaim_epc_pages(nr_to_scan, ignore_age, epc_cg);
 	cond_resched();
 	return ret;
 }
 
-static bool sgx_can_reclaim(void)
-{
-	return !list_empty(&sgx_global_lru.reclaimable);
-}
-
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
@@ -432,7 +463,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		__sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
+		__sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 }
 
 static int ksgxd(void *p)
@@ -458,7 +489,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 	}
 
 	return 0;
@@ -620,6 +651,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 {
 	struct sgx_epc_page *page;
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_try_charge(current->mm, reclaim);
+	if (IS_ERR(epc_cg))
+		return ERR_CAST(epc_cg);
 
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
@@ -628,8 +664,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (!sgx_can_reclaim())
-			return ERR_PTR(-ENOMEM);
+		if (!sgx_can_reclaim()) {
+			page = ERR_PTR(-ENOMEM);
+			break;
+		}
 
 		if (!reclaim) {
 			page = ERR_PTR(-EBUSY);
@@ -641,7 +679,14 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
+	}
+
+	if (!IS_ERR(page)) {
+		WARN_ON(page->epc_cg);
+		page->epc_cg = epc_cg;
+	} else {
+		sgx_epc_cgroup_uncharge(epc_cg);
 	}
 
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
@@ -674,6 +719,12 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	page->flags = SGX_EPC_PAGE_IS_FREE;
 
 	spin_unlock(&node->lock);
+
+	if (page->epc_cg) {
+		sgx_epc_cgroup_uncharge(page->epc_cg);
+		page->epc_cg = NULL;
+	}
+
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
@@ -838,6 +889,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 		section->pages[i].flags = 0;
 		section->pages[i].encl_owner = NULL;
 		section->pages[i].poison = 0;
+		section->pages[i].epc_cg = NULL;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
@@ -1002,6 +1054,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
 static bool __init sgx_page_cache_init(void)
 {
 	u32 eax, ebx, ecx, edx, type;
+	u64 capacity = 0;
 	u64 pa, size;
 	int nid;
 	int i;
@@ -1052,6 +1105,7 @@ static bool __init sgx_page_cache_init(void)
 
 		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
 		sgx_numa_nodes[nid].size += size;
+		capacity += size;
 
 		sgx_nr_epc_sections++;
 	}
@@ -1061,6 +1115,8 @@ static bool __init sgx_page_cache_init(void)
 		return false;
 	}
 
+	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+
 	return true;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 1c666b25294b..defb48f51145 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -35,6 +35,8 @@
 #define SGX_EPC_PAGE_ENCLAVE		BIT(4)
 #define SGX_EPC_PAGE_VERSION_ARRAY	BIT(5)
 
+struct sgx_epc_cgroup;
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
@@ -46,6 +48,7 @@ struct sgx_epc_page {
 		struct sgx_encl *encl;
 	};
 	struct list_head list;
+	struct sgx_epc_cgroup *epc_cg;
 };
 
 /*
@@ -206,7 +209,8 @@ void sgx_reclaim_direct(void);
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
-int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age);
+int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age,
+			  struct sgx_epc_cgroup *epc_cg);
 void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, int *nr_to_scan,
 			   struct list_head *dst);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 18/18] Docs/x86/sgx: Add description for cgroup support
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (16 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 17/18] x86/sgx: Add support for misc cgroup controller Kristen Carlson Accardi
@ 2022-12-02 18:36 ` Kristen Carlson Accardi
  2023-04-03 21:26 ` [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Anand Krishnamoorthi
  18 siblings, 0 replies; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 18:36 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Jonathan Corbet
  Cc: zhiquan1.li, Kristen Carlson Accardi, Sean Christopherson,
	Bagas Sanjaya, linux-doc

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
---
 Documentation/x86/sgx.rst | 77 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
index 2bcbffacbed5..f6ca5594dcf2 100644
--- a/Documentation/x86/sgx.rst
+++ b/Documentation/x86/sgx.rst
@@ -300,3 +300,80 @@ to expected failures and handle them as follows:
    first call.  It indicates a bug in the kernel or the userspace client
    if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
    a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that
+is used to provide SGX-enabled applications with protected memory,
+and is otherwise inaccessible, i.e. shows up as reserved in
+/proc/iomem and cannot be read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM,
+for all intents and purposes the EPC is independent from normal system
+memory, e.g. must be reserved at boot from RAM and cannot be converted
+between EPC and normal memory while the system is running.  The EPC is
+managed by the SGX subsystem and is not accounted by the memory
+controller.  Note that this is true only for EPC memory itself, i.e.
+normal memory allocations related to SGX and EPC memory, e.g. the
+backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via
+virtual memory techniques and pages can be swapped out of the EPC
+to their backing store (normal system memory allocated via shmem).
+The SGX EPC subsystem is analogous to the memory subsytem, and
+it implements limit and protection models for EPC memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface
+files, please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated
+otherwise.  If a value which is not PAGE_SIZE aligned is written,
+the actual value used by the controller will be rounded down to
+the closest PAGE_SIZE multiple.
+
+  misc.capacity
+        A read-only flat-keyed file shown only in the root cgroup.
+        The sgx_epc resource will show the total amount of EPC
+        memory available on the platform.
+
+  misc.current
+        A read-only flat-keyed file shown in the non-root cgroups.
+        The sgx_epc resource will show the current active EPC memory
+        usage of the cgroup and its descendants. EPC pages that are
+        swapped out to backing RAM are not included in the current count.
+
+  misc.max
+        A read-write single value file which exists on non-root
+        cgroups. The sgx_epc resource will show the EPC usage
+        hard limit. The default is "max".
+
+        If a cgroup's EPC usage reaches this limit, EPC allocations,
+        e.g. for page fault handling, will be blocked until EPC can
+        be reclaimed from the cgroup.  If EPC cannot be reclaimed in
+        a timely manner, reclaim will be forced, e.g. by ignoring LRU.
+
+  misc.events
+	A read-write flat-keyed file which exists on non-root cgroups.
+	Writes to the file reset the event counters to zero.  A value
+	change in this file generates a file modified event.
+
+	  max
+		The number of times the cgroup has triggered a reclaim
+		due to its EPC usage approaching (or exceeding) its max
+		EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it
+remains charged to the original cgroup until the page is released
+or reclaimed.  Migrating a process to a different cgroup doesn't
+move the EPC charges that it incurred while in the previous cgroup
+to its new cgroup.
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
  2022-12-02 18:36 ` [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages() Kristen Carlson Accardi
@ 2022-12-02 21:33   ` Dave Hansen
  2022-12-02 21:37     ` Kristen Carlson Accardi
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 21:33 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> In order to avoid repetition of cond_resched() in ksgxd() and
> sgx_alloc_epc_page(), move the invocation of post-reclaim cond_resched()
> inside sgx_reclaim_pages(). Except in the case of sgx_reclaim_direct(),
> sgx_reclaim_pages() is always called in a loop and is always followed
> by a call to cond_resched().  This will hold true for the EPC cgroup
> as well, which adds even more calls to sgx_reclaim_pages() and thus
> cond_resched(). Calls to sgx_reclaim_direct() may be performance
> sensitive. Allow sgx_reclaim_direct() to avoid the cond_resched()
> call by moving the original sgx_reclaim_pages() call to
> __sgx_reclaim_pages() and then have sgx_reclaim_pages() become a
> wrapper around that call with a cond_resched().
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 160c8dbee0ab..ffce6fc70a1f 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -287,7 +287,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -static void sgx_reclaim_pages(void)
> +static void __sgx_reclaim_pages(void)
>  {
>  	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
>  	struct sgx_backing backing[SGX_NR_TO_SCAN];
> @@ -369,6 +369,12 @@ static void sgx_reclaim_pages(void)
>  	}
>  }
>  
> +static void sgx_reclaim_pages(void)
> +{
> +	__sgx_reclaim_pages();
> +	cond_resched();
> +}

Why bother with the wrapper?  Can't we just put cond_resched() in the
existing sgx_reclaim_pages()?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2022-12-02 18:36 ` [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages Kristen Carlson Accardi
@ 2022-12-02 21:35   ` Dave Hansen
  2022-12-02 21:40     ` Kristen Carlson Accardi
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 21:35 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> When allocating new Version Array (VA) pages, pass the struct sgx_encl
> of the enclave that is allocating the page. sgx_alloc_epc_page() will
> store this value in the encl_owner field of the struct sgx_epc_page. In
> a later patch, VA pages will be placed in an unreclaimable queue,
> and then when the cgroup max limit is reached and there are no more
> reclaimable pages and the enclave must be oom killed, all the
> VA pages associated with that enclave can be uncharged and freed.

What does this have to do with the 'encl' that is being passed, though?

In other words, why is this new sgx_epc_page-to-encl mapping needed for
VA pages now, but it wasn't before?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
  2022-12-02 21:33   ` Dave Hansen
@ 2022-12-02 21:37     ` Kristen Carlson Accardi
  2022-12-02 21:45       ` Dave Hansen
  0 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 21:37 UTC (permalink / raw)
  To: Dave Hansen, jarkko, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On Fri, 2022-12-02 at 13:33 -0800, Dave Hansen wrote:
> On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > In order to avoid repetition of cond_resched() in ksgxd() and
> > sgx_alloc_epc_page(), move the invocation of post-reclaim
> > cond_resched()
> > inside sgx_reclaim_pages(). Except in the case of
> > sgx_reclaim_direct(),
> > sgx_reclaim_pages() is always called in a loop and is always
> > followed
> > by a call to cond_resched().  This will hold true for the EPC
> > cgroup
> > as well, which adds even more calls to sgx_reclaim_pages() and thus
> > cond_resched(). Calls to sgx_reclaim_direct() may be performance
> > sensitive. Allow sgx_reclaim_direct() to avoid the cond_resched()
> > call by moving the original sgx_reclaim_pages() call to
> > __sgx_reclaim_pages() and then have sgx_reclaim_pages() become a
> > wrapper around that call with a cond_resched().
> > 
> > Signed-off-by: Sean Christopherson
> > <sean.j.christopherson@intel.com>
> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > Cc: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/main.c | 17 +++++++++++------
> >  1 file changed, 11 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c
> > b/arch/x86/kernel/cpu/sgx/main.c
> > index 160c8dbee0ab..ffce6fc70a1f 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -287,7 +287,7 @@ static void sgx_reclaimer_write(struct
> > sgx_epc_page *epc_page,
> >   * problematic as it would increase the lock contention too much,
> > which would
> >   * halt forward progress.
> >   */
> > -static void sgx_reclaim_pages(void)
> > +static void __sgx_reclaim_pages(void)
> >  {
> >         struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
> >         struct sgx_backing backing[SGX_NR_TO_SCAN];
> > @@ -369,6 +369,12 @@ static void sgx_reclaim_pages(void)
> >         }
> >  }
> >  
> > +static void sgx_reclaim_pages(void)
> > +{
> > +       __sgx_reclaim_pages();
> > +       cond_resched();
> > +}
> 
> Why bother with the wrapper?  Can't we just put cond_resched() in the
> existing sgx_reclaim_pages()?

Because sgx_reclaim_direct() needs to call sgx_reclaim_pages() but not
do the cond_resched(). It was this or add a boolean or something to let
caller's opt out of the resched.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 03/18] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2022-12-02 18:36 ` [PATCH v2 03/18] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s) Kristen Carlson Accardi
@ 2022-12-02 21:39   ` Dave Hansen
  2022-12-08 15:31   ` Jarkko Sakkinen
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 21:39 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> Introduce a data structure to wrap the existing reclaimable list
> and its spinlock in a struct to minimize the code changes needed
> to handle multiple LRUs as well as reclaimable and non-reclaimable
> lists, both of which will be introduced and used by SGX EPC cgroups.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>

Tiny nits: Let's also allude to the fact that this doesn't do anything
with the new helpers or structures for now.

I also think it's probably a sane idea to mention that the core VM also
has parallel LRU lists for cgroups.

> +static inline struct sgx_epc_page *
> +sgx_epc_pop_reclaimable(struct sgx_epc_lru_lists *lrus)
> +{
> +	return __sgx_epc_page_list_pop(&(lrus)->reclaimable);
> +}

Are those '(lrus)' parenthesis doing anything useful?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2022-12-02 21:35   ` Dave Hansen
@ 2022-12-02 21:40     ` Kristen Carlson Accardi
  2022-12-02 21:48       ` Dave Hansen
  0 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 21:40 UTC (permalink / raw)
  To: Dave Hansen, jarkko, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On Fri, 2022-12-02 at 13:35 -0800, Dave Hansen wrote:
> On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> > When allocating new Version Array (VA) pages, pass the struct
> > sgx_encl
> > of the enclave that is allocating the page. sgx_alloc_epc_page()
> > will
> > store this value in the encl_owner field of the struct
> > sgx_epc_page. In
> > a later patch, VA pages will be placed in an unreclaimable queue,
> > and then when the cgroup max limit is reached and there are no more
> > reclaimable pages and the enclave must be oom killed, all the
> > VA pages associated with that enclave can be uncharged and freed.
> 
> What does this have to do with the 'encl' that is being passed,
> though?
> 
> In other words, why is this new sgx_epc_page-to-encl mapping needed
> for
> VA pages now, but it wasn't before?

When we OOM kill an enclave, we want to get rid of all the associated
VA pages too. Prior to this patch, there wasn't a way to easily get the
VA pages associated with an enclave.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list
  2022-12-02 18:36 ` [PATCH v2 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list Kristen Carlson Accardi
@ 2022-12-02 21:43   ` Dave Hansen
  2022-12-02 21:51     ` Kristen Carlson Accardi
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 21:43 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> +	spin_lock(&sgx_global_lru.lock);
>  	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> -		if (list_empty(&sgx_active_page_list))
> +		epc_page = sgx_epc_pop_reclaimable(&sgx_global_lru);
> +		if (!epc_page)
>  			break;

One other nit about the structure of the patches: This introduced *both*
reclaimable and unreclaimable list_heads.  But, it has zero use for the
unreclaimable ones during the refactoring here.  I probably would have
left out the 'unreclaimable' bits for now.

BTW, this is a nice sign:

>  arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
>  1 file changed, 19 insertions(+), 20 deletions(-)



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
  2022-12-02 21:37     ` Kristen Carlson Accardi
@ 2022-12-02 21:45       ` Dave Hansen
  2022-12-02 22:17         ` Kristen Carlson Accardi
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 21:45 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 13:37, Kristen Carlson Accardi wrote:
>>> +static void sgx_reclaim_pages(void)
>>> +{
>>> +       __sgx_reclaim_pages();
>>> +       cond_resched();
>>> +}
>> Why bother with the wrapper?  Can't we just put cond_resched() in the
>> existing sgx_reclaim_pages()?
> Because sgx_reclaim_direct() needs to call sgx_reclaim_pages() but not
> do the cond_resched(). It was this or add a boolean or something to let
> caller's opt out of the resched.

Is there a reason sgx_reclaim_direct() *can't* or shouldn't call
cond_resched()?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2022-12-02 21:40     ` Kristen Carlson Accardi
@ 2022-12-02 21:48       ` Dave Hansen
  2022-12-02 22:35         ` Sean Christopherson
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 21:48 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 13:40, Kristen Carlson Accardi wrote:
> On Fri, 2022-12-02 at 13:35 -0800, Dave Hansen wrote:
>> On 12/2/22 10:36, Kristen Carlson Accardi wrote:
>>> When allocating new Version Array (VA) pages, pass the struct
>>> sgx_encl
>>> of the enclave that is allocating the page. sgx_alloc_epc_page()
>>> will
>>> store this value in the encl_owner field of the struct
>>> sgx_epc_page. In
>>> a later patch, VA pages will be placed in an unreclaimable queue,
>>> and then when the cgroup max limit is reached and there are no more
>>> reclaimable pages and the enclave must be oom killed, all the
>>> VA pages associated with that enclave can be uncharged and freed.
>> What does this have to do with the 'encl' that is being passed,
>> though?
>>
>> In other words, why is this new sgx_epc_page-to-encl mapping needed
>> for
>> VA pages now, but it wasn't before?
> When we OOM kill an enclave, we want to get rid of all the associated
> VA pages too. Prior to this patch, there wasn't a way to easily get the
> VA pages associated with an enclave.

Given an enclave, we have encl->va_pages to look up all the VA pages.
Also, this patch's code allows you to go from a va page to an enclave.
That seems like it's going the other direction from what an OOM-kill
would need to do.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list
  2022-12-02 21:43   ` Dave Hansen
@ 2022-12-02 21:51     ` Kristen Carlson Accardi
  2022-12-02 22:10       ` Dave Hansen
  0 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 21:51 UTC (permalink / raw)
  To: Dave Hansen, jarkko, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On Fri, 2022-12-02 at 13:43 -0800, Dave Hansen wrote:
> On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> > +       spin_lock(&sgx_global_lru.lock);
> >         for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> > -               if (list_empty(&sgx_active_page_list))
> > +               epc_page =
> > sgx_epc_pop_reclaimable(&sgx_global_lru);
> > +               if (!epc_page)
> >                         break;
> 
> One other nit about the structure of the patches: This introduced
> *both*
> reclaimable and unreclaimable list_heads.  But, it has zero use for
> the
> unreclaimable ones during the refactoring here.  I probably would
> have
> left out the 'unreclaimable' bits for now.

I know - and originally the addition of unreclaimable was added later,
but when I posted the RFC I felt there was some misunderstanding about
what this data structure was and how it would be used because the
addition of the unreclaimable bits came later. So I stuck both lists in
one so it'd be a better view of what the data structure would look
like.

> 
> BTW, this is a nice sign:
> 
> >  arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-------------
> > ----
> >  1 file changed, 19 insertions(+), 20 deletions(-)
> 
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list
  2022-12-02 21:51     ` Kristen Carlson Accardi
@ 2022-12-02 22:10       ` Dave Hansen
  0 siblings, 0 replies; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 22:10 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 13:51, Kristen Carlson Accardi wrote:
> I know - and originally the addition of unreclaimable was added later,
> but when I posted the RFC I felt there was some misunderstanding about
> what this data structure was and how it would be used because the
> addition of the unreclaimable bits came later. So I stuck both lists in
> one so it'd be a better view of what the data structure would look
> like.

You're not insane for thinking that.

But, it's really OK to introduce an abstraction that *looks* silly on
its face at first.  You can easily just make up for it by saying:

	struct silly_abstraction {
		struct list_head list;
	}

	Oh, boy does my structure look silly.  It's a structure with a
	single list_head.  Why oh why would I do something silly like
	that?  Well, for now, the code has but one list.  Soon, I'll
	add a whole smorgasbord of lists.  Bear with me for now.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 05/18] x86/sgx: Track epc pages on reclaimable or unreclaimable lists
  2022-12-02 18:36 ` [PATCH v2 05/18] x86/sgx: Track epc pages on reclaimable or unreclaimable lists Kristen Carlson Accardi
@ 2022-12-02 22:13   ` Dave Hansen
  2022-12-02 22:28     ` Sean Christopherson
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 22:13 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> Replace functions sgx_mark_page_reclaimable() and
> sgx_unmark_page_reclaimable() with sgx_record_epc_page() and
> sgx_drop_epc_page(). sgx_record_epc_page() wil add the epc_page
> to the correct "reclaimable" or "unreclaimable" list in the
> sgx_epc_lru_lists struct. sgx_drop_epc_page() will delete the page
> from the LRU list. Tracking pages that are not tracked by
> the reclaimer in the sgx_epc_lru_lists "unreclaimable" list allows
> an OOM event to cause all the pages in use by an enclave to be freed,
> regardless of whether they were reclaimable pages or not.

This might be more a comment about Sean's stuff, but could you please
start using paragraphs in these changelogs?

Also, on the content, I really prefer that patches start off talking in
English as much as possible and not just talk about the code.

	Right now, SGX has a single LRU list.  The code is transitioning
	over to use multiple LRU lists.

I'd also prefer that _this_ patch do the:

> -	sgx_mark_page_reclaimable(entry->epc_page);
> +	sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);

bits and then *another* patch do the unreclaimable side.  This patch
could be a straight replacement which is easy to audit.  The
unreclaimable one needs more thought.

I also think this ends up looking a bit weird:

> -	sgx_epc_push_reclaimable(&sgx_global_lru, page);
> +	WARN_ON(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	page->flags |= flags;
> +	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
> +		sgx_epc_push_reclaimable(&sgx_global_lru, page);
> +	else
> +		sgx_epc_push_unreclaimable(&sgx_global_lru, page);
>  	spin_unlock(&sgx_global_lru.lock);
>  }

I think that would be better with a single "push" helper and then let
the callers specify the list:

	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
		sgx_lru_push(&sgx_global_lru.reclaimable, page);
	else
		sgx_lru_push(&sgx_global_lru.unreclaimable, page);

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
  2022-12-02 18:36 ` [PATCH v2 06/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages Kristen Carlson Accardi
@ 2022-12-02 22:15   ` Dave Hansen
  2022-12-08 15:46   ` Jarkko Sakkinen
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 22:15 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> When selecting pages to be reclaimed from the page pool (sgx_global_lru),
> the list of reclaimable pages is walked, and any page that is both
> reclaimable and not in the process of being freed is added to a list of
> potential candidates to be reclaimed. After that, this separate list is
> further examined and may or may not ultimately be reclaimed. In order
> to prevent this page from being removed from the sgx_epc_lru_lists
> struct in a separate thread by sgx_drop_epc_page(), keep track of
> whether the EPC page is in the middle of being reclaimed with
> the addtion of a RECLAIM_IN_PROGRESS flag, and do not delete the page
> off the LRU in sgx_drop_epc_page() if it has not yet finished being
> reclaimed.

This never really comes out and tells us what problem is being addressed.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
  2022-12-02 21:45       ` Dave Hansen
@ 2022-12-02 22:17         ` Kristen Carlson Accardi
  2022-12-02 22:37           ` Dave Hansen
  0 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-02 22:17 UTC (permalink / raw)
  To: Dave Hansen, jarkko, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On Fri, 2022-12-02 at 13:45 -0800, Dave Hansen wrote:
> On 12/2/22 13:37, Kristen Carlson Accardi wrote:
> > > > +static void sgx_reclaim_pages(void)
> > > > +{
> > > > +       __sgx_reclaim_pages();
> > > > +       cond_resched();
> > > > +}
> > > Why bother with the wrapper?  Can't we just put cond_resched() in
> > > the
> > > existing sgx_reclaim_pages()?
> > Because sgx_reclaim_direct() needs to call sgx_reclaim_pages() but
> > not
> > do the cond_resched(). It was this or add a boolean or something to
> > let
> > caller's opt out of the resched.
> 
> Is there a reason sgx_reclaim_direct() *can't* or shouldn't call
> cond_resched()?

Yes, it is due to performance concerns. It is explained most succinctly
by Reinette here:

https://lore.kernel.org/linux-sgx/a4eb5ab0-bf83-17a4-8bc0-a90aaf438a8e@intel.com/


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 05/18] x86/sgx: Track epc pages on reclaimable or unreclaimable lists
  2022-12-02 22:13   ` Dave Hansen
@ 2022-12-02 22:28     ` Sean Christopherson
  0 siblings, 0 replies; 65+ messages in thread
From: Sean Christopherson @ 2022-12-02 22:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, zhiquan1.li

On Fri, Dec 02, 2022, Dave Hansen wrote:
> On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> > Replace functions sgx_mark_page_reclaimable() and
> > sgx_unmark_page_reclaimable() with sgx_record_epc_page() and
> > sgx_drop_epc_page(). sgx_record_epc_page() wil add the epc_page
> > to the correct "reclaimable" or "unreclaimable" list in the
> > sgx_epc_lru_lists struct. sgx_drop_epc_page() will delete the page
> > from the LRU list. Tracking pages that are not tracked by
> > the reclaimer in the sgx_epc_lru_lists "unreclaimable" list allows
> > an OOM event to cause all the pages in use by an enclave to be freed,
> > regardless of whether they were reclaimable pages or not.
> 
> This might be more a comment about Sean's stuff

Anything with a single space after a period wasn't written by me, I'm a devout
believer of two spaces :-)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 07/18] x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
  2022-12-02 18:36 ` [PATCH v2 07/18] x86/sgx: Use a list to track to-be-reclaimed pages during reclaim Kristen Carlson Accardi
@ 2022-12-02 22:33   ` Dave Hansen
  2022-12-05 16:33     ` Kristen Carlson Accardi
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 22:33 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Change sgx_reclaim_pages() to use a list rather than an array for
> storing the epc_pages which will be reclaimed. This change is needed
> to transition to the LRU implementation for EPC cgroup support.
> 
> This change requires keeping track of whether newly recorded
> EPC pages are pages for VA Arrays, or for Enclave data. In addition,
> helper functions are added to move pages from one list to another and
> enforce a consistent queue like behavior for the LRU lists.

More changelog nit: Please use imperative voice, not passive voice.
Move from:

	In addition, helper functions are added

to:

	In addition, add helper functions

> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 4683da9ef4f1..9ee306ac2a8e 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -252,7 +252,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>  		epc_page = sgx_encl_eldu(&encl->secs, NULL);
>  		if (IS_ERR(epc_page))
>  			return ERR_CAST(epc_page);
> -		sgx_record_epc_page(epc_page, 0);
> +		sgx_record_epc_page(epc_page, SGX_EPC_PAGE_ENCLAVE);
>  	}

This is one of those patches where the first hunk seems like it is
entirely disconnected from what the changelog made me expect I would see.

I don't see sgx_reclaim_pages(), or lists or arrays.

If you need to pass additional data down into a function, then do *that*
in a separate patch.

I'm glad it eventually got fixed up, but I don't really ever like to see
bare integers that don't have obvious meaning:

	sgx_record_epc_page(epc_page, 0);

Even if you had:

#define SGX_EPC_PAGE_RECLAIMER_UNTRACKED 0

	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_UNTRACKED);

makes a *LOT* of sense compared to other callers that do

	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);

>  	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
> @@ -260,7 +260,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>  		return ERR_CAST(epc_page);
>  
>  	encl->secs_child_cnt++;
> -	sgx_record_epc_page(entry->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	sgx_record_epc_page(entry->epc_page,
> +			    (SGX_EPC_PAGE_ENCLAVE | SGX_EPC_PAGE_RECLAIMER_TRACKED));
>  
>  	return entry;
>  }
> @@ -1221,7 +1222,7 @@ struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
>  		sgx_encl_free_epc_page(epc_page);
>  		return ERR_PTR(-EFAULT);
>  	}
> -	sgx_record_epc_page(epc_page, 0);
> +	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_VERSION_ARRAY);
>  
>  	return epc_page;
>  }
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index aca80a3f38a1..c3a9bffbc37e 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -114,7 +114,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
>  	encl->attributes = secs->attributes;
>  	encl->attributes_mask = SGX_ATTR_DEBUG | SGX_ATTR_MODE64BIT | SGX_ATTR_KSS;
>  
> -	sgx_record_epc_page(encl->secs.epc_page, 0);
> +	sgx_record_epc_page(encl->secs.epc_page, SGX_EPC_PAGE_ENCLAVE);
>  
>  	/* Set only after completion, as encl->lock has not been taken. */
>  	set_bit(SGX_ENCL_CREATED, &encl->flags);
> @@ -325,7 +325,8 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
>  			goto err_out;
>  	}
>  
> -	sgx_record_epc_page(encl_page->epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	sgx_record_epc_page(encl_page->epc_page,
> +			    (SGX_EPC_PAGE_ENCLAVE | SGX_EPC_PAGE_RECLAIMER_TRACKED));
>  	mutex_unlock(&encl->lock);
>  	mmap_read_unlock(current->mm);
>  	return ret;
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index bad72498b0a7..83aaf5cea7b9 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -288,37 +288,43 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>   */
>  static void __sgx_reclaim_pages(void)
>  {
> -	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
>  	struct sgx_backing backing[SGX_NR_TO_SCAN];
> +	struct sgx_epc_page *epc_page, *tmp;
>  	struct sgx_encl_page *encl_page;
> -	struct sgx_epc_page *epc_page;
>  	pgoff_t page_index;
> -	int cnt = 0;
> +	LIST_HEAD(iso);
>  	int ret;
>  	int i;
>  
>  	spin_lock(&sgx_global_lru.lock);
>  	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> -		epc_page = sgx_epc_pop_reclaimable(&sgx_global_lru);
> +		epc_page = sgx_epc_peek_reclaimable(&sgx_global_lru);
>  		if (!epc_page)
>  			break;
>  
>  		encl_page = epc_page->encl_owner;
>  
> +		if (WARN_ON_ONCE(!(epc_page->flags & SGX_EPC_PAGE_ENCLAVE)))
> +			continue;
> +
>  		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
>  			epc_page->flags |= SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
> -			chunk[cnt++] = epc_page;
> +			list_move_tail(&epc_page->list, &iso);
>  		} else {
> -			/* The owner is freeing the page. No need to add the
> -			 * page back to the list of reclaimable pages.
> +			/* The owner is freeing the page, remove it from the
> +			 * LRU list
>  			 */
>  			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +			list_del_init(&epc_page->list);
>  		}
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> -	for (i = 0; i < cnt; i++) {
> -		epc_page = chunk[i];
> +	if (list_empty(&iso))
> +		return;
> +
> +	i = 0;
> +	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
>  		encl_page = epc_page->encl_owner;
>  
>  		if (!sgx_reclaimer_age(epc_page))
> @@ -333,6 +339,7 @@ static void __sgx_reclaim_pages(void)
>  			goto skip;
>  		}
>  
> +		i++;
>  		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
>  		mutex_unlock(&encl_page->encl->lock);
>  		continue;
> @@ -340,31 +347,25 @@ static void __sgx_reclaim_pages(void)
>  skip:
>  		spin_lock(&sgx_global_lru.lock);
>  		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
> -		sgx_epc_push_reclaimable(&sgx_global_lru, epc_page);
> +		sgx_epc_move_reclaimable(&sgx_global_lru, epc_page);
>  		spin_unlock(&sgx_global_lru.lock);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> -
> -		chunk[i] = NULL;
>  	}
>  
> -	for (i = 0; i < cnt; i++) {
> -		epc_page = chunk[i];
> -		if (epc_page)
> -			sgx_reclaimer_block(epc_page);
> -	}
> -
> -	for (i = 0; i < cnt; i++) {
> -		epc_page = chunk[i];
> -		if (!epc_page)
> -			continue;
> -
> +	list_for_each_entry(epc_page, &iso, list)
> +		sgx_reclaimer_block(epc_page);
> + 
> +	i = 0;
> +	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
>  		encl_page = epc_page->encl_owner;
> -		sgx_reclaimer_write(epc_page, &backing[i]);
> +		sgx_reclaimer_write(epc_page, &backing[i++]);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
>  		epc_page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED |
> -				     SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
> +				     SGX_EPC_PAGE_RECLAIM_IN_PROGRESS |
> +				     SGX_EPC_PAGE_ENCLAVE |
> +				     SGX_EPC_PAGE_VERSION_ARRAY);
>  
>  		sgx_free_epc_page(epc_page);
>  	}
> @@ -505,6 +506,7 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  /**
>   * sgx_record_epc_page() - Add a page to the LRU tracking
>   * @page:	EPC page
> + * @flags:	Reclaim flags for the page.
>   *
>   * Mark a page with the specified flags and add it to the appropriate
>   * (un)reclaimable list.
> @@ -535,18 +537,19 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  int sgx_drop_epc_page(struct sgx_epc_page *page)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> -		/* The page is being reclaimed. */
> -		if (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS) {
> -			spin_unlock(&sgx_global_lru.lock);
> -			return -EBUSY;
> -		}
> -
> -		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +	if ((page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) &&
> +	    (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS)) {
> +		spin_unlock(&sgx_global_lru.lock);
> +		return -EBUSY;
>  	}
>  	list_del(&page->list);
>  	spin_unlock(&sgx_global_lru.lock);
>  
> +	page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED |
> +			 SGX_EPC_PAGE_RECLAIM_IN_PROGRESS |
> +			 SGX_EPC_PAGE_ENCLAVE |
> +			 SGX_EPC_PAGE_VERSION_ARRAY);
> +
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 37d66bc6ca27..ec8d567cd975 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -32,6 +32,8 @@
>  #define SGX_EPC_PAGE_KVM_GUEST		BIT(2)
>  /* page flag to indicate reclaim is in progress */
>  #define SGX_EPC_PAGE_RECLAIM_IN_PROGRESS BIT(3)
> +#define SGX_EPC_PAGE_ENCLAVE		BIT(4)
> +#define SGX_EPC_PAGE_VERSION_ARRAY	BIT(5)

Could you please spend some time to clearly document what each bit means?

> +static inline void __sgx_epc_page_list_move(struct list_head *list, struct sgx_epc_page *page)
> +{
> +	list_move_tail(&page->list, list);
> +}

I'm not sure I get the point of a helper like this.  Why not just have
the caller call list_move() directly?

>  /*
>   * Must be called with queue lock acquired
>   */
> @@ -157,6 +167,38 @@ static inline void sgx_epc_push_unreclaimable(struct sgx_epc_lru_lists *lrus,
>  	__sgx_epc_page_list_push(&(lrus)->unreclaimable, page);
>  }
>  
> +/*
> + * Must be called with queue lock acquired
> + */
> +static inline struct sgx_epc_page * __sgx_epc_page_list_peek(struct list_head *list)
> +{
> +	struct sgx_epc_page *epc_page;
> +
> +	if (list_empty(list))
> +		return NULL;
> +
> +	epc_page = list_first_entry(list, struct sgx_epc_page, list);
> +	return epc_page;
> +}

list_first_entry_or_null() perhaps?

> +static inline struct sgx_epc_page *
> +sgx_epc_peek_reclaimable(struct sgx_epc_lru_lists *lrus)
> +{
> +	return __sgx_epc_page_list_peek(&(lrus)->reclaimable);
> +}
> +
> +static inline void sgx_epc_move_reclaimable(struct sgx_epc_lru_lists *lru,
> +					    struct sgx_epc_page *page)
> +{
> +	__sgx_epc_page_list_move(&(lru)->reclaimable, page);
> +}
> +
> +static inline struct sgx_epc_page *
> +sgx_epc_peek_unreclaimable(struct sgx_epc_lru_lists *lrus)
> +{
> +	return __sgx_epc_page_list_peek(&(lrus)->unreclaimable);
> +}

In general, I'm not becoming more fond of these helpers as the series
goes along.  My worry is that they're an abstraction where we don't
*really* need one.  I don't seem them growing much functionality as the
series goes along.

I'll reserve judgement until the end though.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2022-12-02 21:48       ` Dave Hansen
@ 2022-12-02 22:35         ` Sean Christopherson
  2022-12-02 22:47           ` Dave Hansen
  0 siblings, 1 reply; 65+ messages in thread
From: Sean Christopherson @ 2022-12-02 22:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, zhiquan1.li

On Fri, Dec 02, 2022, Dave Hansen wrote:
> On 12/2/22 13:40, Kristen Carlson Accardi wrote:
> > On Fri, 2022-12-02 at 13:35 -0800, Dave Hansen wrote:
> >> On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> >>> When allocating new Version Array (VA) pages, pass the struct
> >>> sgx_encl
> >>> of the enclave that is allocating the page. sgx_alloc_epc_page()
> >>> will
> >>> store this value in the encl_owner field of the struct
> >>> sgx_epc_page. In
> >>> a later patch, VA pages will be placed in an unreclaimable queue,
> >>> and then when the cgroup max limit is reached and there are no more
> >>> reclaimable pages and the enclave must be oom killed, all the
> >>> VA pages associated with that enclave can be uncharged and freed.
> >> What does this have to do with the 'encl' that is being passed,
> >> though?
> >>
> >> In other words, why is this new sgx_epc_page-to-encl mapping needed
> >> for
> >> VA pages now, but it wasn't before?
> > When we OOM kill an enclave, we want to get rid of all the associated
> > VA pages too. Prior to this patch, there wasn't a way to easily get the
> > VA pages associated with an enclave.
> 
> Given an enclave, we have encl->va_pages to look up all the VA pages.
> Also, this patch's code allows you to go from a va page to an enclave.

Yep.

> That seems like it's going the other direction from what an OOM-kill
> would need to do.

Providing a backpointer from a VA page to its enclave allows OOM-killing the enclave
if its cgroup is over the limit but there are no reclaimable pages for said cgroup
(for SGX's definition of "reclaimable").  I.e. if all of an enclave's "regular"
pages have been swapped out, the only thing left resident in the EPC will be the
enclave's VA pages, which are not reclaimable in the kernel's current SGX
implementation.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
  2022-12-02 22:17         ` Kristen Carlson Accardi
@ 2022-12-02 22:37           ` Dave Hansen
  0 siblings, 0 replies; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 22:37 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/2/22 14:17, Kristen Carlson Accardi wrote:
> On Fri, 2022-12-02 at 13:45 -0800, Dave Hansen wrote:
>> On 12/2/22 13:37, Kristen Carlson Accardi wrote:
>>>>> +static void sgx_reclaim_pages(void)
>>>>> +{
>>>>> +       __sgx_reclaim_pages();
>>>>> +       cond_resched();
>>>>> +}
>>>> Why bother with the wrapper?  Can't we just put cond_resched() in
>>>> the
>>>> existing sgx_reclaim_pages()?
>>> Because sgx_reclaim_direct() needs to call sgx_reclaim_pages()
>>> but not do the cond_resched(). It was this or add a boolean or
>>> something to let caller's opt out of the resched.
>>
>> Is there a reason sgx_reclaim_direct() *can't* or shouldn't call
>> cond_resched()?
> 
> Yes, it is due to performance concerns. It is explained most succinctly
> by Reinette here:
> 
> https://lore.kernel.org/linux-sgx/a4eb5ab0-bf83-17a4-8bc0-a90aaf438a8e@intel.com/

I think I'd much rather have 3 cond_resched()s in the code that
effectively self-document than one __something() in there that's a bit
of a mystery.

Everyone knows what cond_resched() means.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2022-12-02 22:35         ` Sean Christopherson
@ 2022-12-02 22:47           ` Dave Hansen
  2022-12-02 22:49             ` Sean Christopherson
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2022-12-02 22:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, zhiquan1.li

On 12/2/22 14:35, Sean Christopherson wrote:
>> That seems like it's going the other direction from what an OOM-kill
>> would need to do.
> Providing a backpointer from a VA page to its enclave allows OOM-killing the enclave
> if its cgroup is over the limit but there are no reclaimable pages for said cgroup
> (for SGX's definition of "reclaimable").  I.e. if all of an enclave's "regular"
> pages have been swapped out, the only thing left resident in the EPC will be the
> enclave's VA pages, which are not reclaimable in the kernel's current SGX
> implementation.

Ooooooooooooooooooooh.  I'm a dummy.


So, we've got a cgroup.  It's in OOM-kill mode and we're looking at the
*cgroup* LRU lists.  We've done everything we can to the enclave and
swapped everything out that we can.  All we're left with are these
crummy VA pages on the LRU (or equally crummy pages).  We want to
reclaim them but can't swap VA pages.  Our only recourse is to go to the
enclave and kill *it*.

Right now, we can easily find an enclave's VA pages and free them.  We
do that all the time when freeing whole enclaves.  But, what we can't
easily do is find an enclave given a VA page.

A reverse pointer from VA page back to enclave allows the VA page's
enclave to be located and efficiently killed.

Right?

Could we add that context to the changelog, please?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2022-12-02 22:47           ` Dave Hansen
@ 2022-12-02 22:49             ` Sean Christopherson
  0 siblings, 0 replies; 65+ messages in thread
From: Sean Christopherson @ 2022-12-02 22:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, zhiquan1.li

On Fri, Dec 02, 2022, Dave Hansen wrote:
> On 12/2/22 14:35, Sean Christopherson wrote:
> >> That seems like it's going the other direction from what an OOM-kill
> >> would need to do.
> > Providing a backpointer from a VA page to its enclave allows OOM-killing the enclave
> > if its cgroup is over the limit but there are no reclaimable pages for said cgroup
> > (for SGX's definition of "reclaimable").  I.e. if all of an enclave's "regular"
> > pages have been swapped out, the only thing left resident in the EPC will be the
> > enclave's VA pages, which are not reclaimable in the kernel's current SGX
> > implementation.
> 
> Ooooooooooooooooooooh.  I'm a dummy.
> 
> 
> So, we've got a cgroup.  It's in OOM-kill mode and we're looking at the
> *cgroup* LRU lists.  We've done everything we can to the enclave and
> swapped everything out that we can.  All we're left with are these
> crummy VA pages on the LRU (or equally crummy pages).  We want to
> reclaim them but can't swap VA pages.  Our only recourse is to go to the
> enclave and kill *it*.
> 
> Right now, we can easily find an enclave's VA pages and free them.  We
> do that all the time when freeing whole enclaves.  But, what we can't
> easily do is find an enclave given a VA page.
> 
> A reverse pointer from VA page back to enclave allows the VA page's
> enclave to be located and efficiently killed.
> 
> Right?

Yep, exactly.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 07/18] x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
  2022-12-02 22:33   ` Dave Hansen
@ 2022-12-05 16:33     ` Kristen Carlson Accardi
  2022-12-05 17:03       ` Dave Hansen
  0 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-05 16:33 UTC (permalink / raw)
  To: Dave Hansen, jarkko, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On Fri, 2022-12-02 at 14:33 -0800, Dave Hansen wrote:
> On 12/2/22 10:36, Kristen Carlson Accardi wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Change sgx_reclaim_pages() to use a list rather than an array for
> > storing the epc_pages which will be reclaimed. This change is
> > needed
> > to transition to the LRU implementation for EPC cgroup support.
> > 
> > This change requires keeping track of whether newly recorded
> > EPC pages are pages for VA Arrays, or for Enclave data. In
> > addition,
> > helper functions are added to move pages from one list to another
> > and
> > enforce a consistent queue like behavior for the LRU lists.
> 
> More changelog nit: Please use imperative voice, not passive voice.
> Move from:
> 
>         In addition, helper functions are added
> 
> to:
> 
>         In addition, add helper functions
> 
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c
> > b/arch/x86/kernel/cpu/sgx/encl.c
> > index 4683da9ef4f1..9ee306ac2a8e 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -252,7 +252,7 @@ static struct sgx_encl_page
> > *__sgx_encl_load_page(struct sgx_encl *encl,
> >                 epc_page = sgx_encl_eldu(&encl->secs, NULL);
> >                 if (IS_ERR(epc_page))
> >                         return ERR_CAST(epc_page);
> > -               sgx_record_epc_page(epc_page, 0);
> > +               sgx_record_epc_page(epc_page,
> > SGX_EPC_PAGE_ENCLAVE);
> >         }
> 
> This is one of those patches where the first hunk seems like it is
> entirely disconnected from what the changelog made me expect I would
> see.
> 
> I don't see sgx_reclaim_pages(), or lists or arrays.
> 
> If you need to pass additional data down into a function, then do
> *that*
> in a separate patch.
> 
> I'm glad it eventually got fixed up, but I don't really ever like to
> see
> bare integers that don't have obvious meaning:
> 
>         sgx_record_epc_page(epc_page, 0);
> 
> Even if you had:
> 
> #define SGX_EPC_PAGE_RECLAIMER_UNTRACKED 0
> 
>         sgx_record_epc_page(epc_page,
> SGX_EPC_PAGE_RECLAIMER_UNTRACKED);
> 
> makes a *LOT* of sense compared to other callers that do
> 
>         sgx_record_epc_page(epc_page,
> SGX_EPC_PAGE_RECLAIMER_TRACKED);
> 
> >         epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
> > @@ -260,7 +260,8 @@ static struct sgx_encl_page
> > *__sgx_encl_load_page(struct sgx_encl *encl,
> >                 return ERR_CAST(epc_page);
> >  
> >         encl->secs_child_cnt++;
> > -       sgx_record_epc_page(entry->epc_page,
> > SGX_EPC_PAGE_RECLAIMER_TRACKED);
> > +       sgx_record_epc_page(entry->epc_page,
> > +                           (SGX_EPC_PAGE_ENCLAVE |
> > SGX_EPC_PAGE_RECLAIMER_TRACKED));
> >  
> >         return entry;
> >  }
> > @@ -1221,7 +1222,7 @@ struct sgx_epc_page *sgx_alloc_va_page(struct
> > sgx_encl *encl, bool reclaim)
> >                 sgx_encl_free_epc_page(epc_page);
> >                 return ERR_PTR(-EFAULT);
> >         }
> > -       sgx_record_epc_page(epc_page, 0);
> > +       sgx_record_epc_page(epc_page, SGX_EPC_PAGE_VERSION_ARRAY);
> >  
> >         return epc_page;
> >  }
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c
> > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index aca80a3f38a1..c3a9bffbc37e 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -114,7 +114,7 @@ static int sgx_encl_create(struct sgx_encl
> > *encl, struct sgx_secs *secs)
> >         encl->attributes = secs->attributes;
> >         encl->attributes_mask = SGX_ATTR_DEBUG | SGX_ATTR_MODE64BIT
> > | SGX_ATTR_KSS;
> >  
> > -       sgx_record_epc_page(encl->secs.epc_page, 0);
> > +       sgx_record_epc_page(encl->secs.epc_page,
> > SGX_EPC_PAGE_ENCLAVE);
> >  
> >         /* Set only after completion, as encl->lock has not been
> > taken. */
> >         set_bit(SGX_ENCL_CREATED, &encl->flags);
> > @@ -325,7 +325,8 @@ static int sgx_encl_add_page(struct sgx_encl
> > *encl, unsigned long src,
> >                         goto err_out;
> >         }
> >  
> > -       sgx_record_epc_page(encl_page->epc_page,
> > SGX_EPC_PAGE_RECLAIMER_TRACKED);
> > +       sgx_record_epc_page(encl_page->epc_page,
> > +                           (SGX_EPC_PAGE_ENCLAVE |
> > SGX_EPC_PAGE_RECLAIMER_TRACKED));
> >         mutex_unlock(&encl->lock);
> >         mmap_read_unlock(current->mm);
> >         return ret;
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c
> > b/arch/x86/kernel/cpu/sgx/main.c
> > index bad72498b0a7..83aaf5cea7b9 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -288,37 +288,43 @@ static void sgx_reclaimer_write(struct
> > sgx_epc_page *epc_page,
> >   */
> >  static void __sgx_reclaim_pages(void)
> >  {
> > -       struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
> >         struct sgx_backing backing[SGX_NR_TO_SCAN];
> > +       struct sgx_epc_page *epc_page, *tmp;
> >         struct sgx_encl_page *encl_page;
> > -       struct sgx_epc_page *epc_page;
> >         pgoff_t page_index;
> > -       int cnt = 0;
> > +       LIST_HEAD(iso);
> >         int ret;
> >         int i;
> >  
> >         spin_lock(&sgx_global_lru.lock);
> >         for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> > -               epc_page =
> > sgx_epc_pop_reclaimable(&sgx_global_lru);
> > +               epc_page =
> > sgx_epc_peek_reclaimable(&sgx_global_lru);
> >                 if (!epc_page)
> >                         break;
> >  
> >                 encl_page = epc_page->encl_owner;
> >  
> > +               if (WARN_ON_ONCE(!(epc_page->flags &
> > SGX_EPC_PAGE_ENCLAVE)))
> > +                       continue;
> > +
> >                 if (kref_get_unless_zero(&encl_page->encl-
> > >refcount) != 0) {
> >                         epc_page->flags |=
> > SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
> > -                       chunk[cnt++] = epc_page;
> > +                       list_move_tail(&epc_page->list, &iso);
> >                 } else {
> > -                       /* The owner is freeing the page. No need
> > to add the
> > -                        * page back to the list of reclaimable
> > pages.
> > +                       /* The owner is freeing the page, remove it
> > from the
> > +                        * LRU list
> >                          */
> >                         epc_page->flags &=
> > ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> > +                       list_del_init(&epc_page->list);
> >                 }
> >         }
> >         spin_unlock(&sgx_global_lru.lock);
> >  
> > -       for (i = 0; i < cnt; i++) {
> > -               epc_page = chunk[i];
> > +       if (list_empty(&iso))
> > +               return;
> > +
> > +       i = 0;
> > +       list_for_each_entry_safe(epc_page, tmp, &iso, list) {
> >                 encl_page = epc_page->encl_owner;
> >  
> >                 if (!sgx_reclaimer_age(epc_page))
> > @@ -333,6 +339,7 @@ static void __sgx_reclaim_pages(void)
> >                         goto skip;
> >                 }
> >  
> > +               i++;
> >                 encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
> >                 mutex_unlock(&encl_page->encl->lock);
> >                 continue;
> > @@ -340,31 +347,25 @@ static void __sgx_reclaim_pages(void)
> >  skip:
> >                 spin_lock(&sgx_global_lru.lock);
> >                 epc_page->flags &=
> > ~SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
> > -               sgx_epc_push_reclaimable(&sgx_global_lru,
> > epc_page);
> > +               sgx_epc_move_reclaimable(&sgx_global_lru,
> > epc_page);
> >                 spin_unlock(&sgx_global_lru.lock);
> >  
> >                 kref_put(&encl_page->encl->refcount,
> > sgx_encl_release);
> > -
> > -               chunk[i] = NULL;
> >         }
> >  
> > -       for (i = 0; i < cnt; i++) {
> > -               epc_page = chunk[i];
> > -               if (epc_page)
> > -                       sgx_reclaimer_block(epc_page);
> > -       }
> > -
> > -       for (i = 0; i < cnt; i++) {
> > -               epc_page = chunk[i];
> > -               if (!epc_page)
> > -                       continue;
> > -
> > +       list_for_each_entry(epc_page, &iso, list)
> > +               sgx_reclaimer_block(epc_page);
> > + 
> > +       i = 0;
> > +       list_for_each_entry_safe(epc_page, tmp, &iso, list) {
> >                 encl_page = epc_page->encl_owner;
> > -               sgx_reclaimer_write(epc_page, &backing[i]);
> > +               sgx_reclaimer_write(epc_page, &backing[i++]);
> >  
> >                 kref_put(&encl_page->encl->refcount,
> > sgx_encl_release);
> >                 epc_page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED
> > |
> > -                                   
> > SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
> > +                                   
> > SGX_EPC_PAGE_RECLAIM_IN_PROGRESS |
> > +                                    SGX_EPC_PAGE_ENCLAVE |
> > +                                    SGX_EPC_PAGE_VERSION_ARRAY);
> >  
> >                 sgx_free_epc_page(epc_page);
> >         }
> > @@ -505,6 +506,7 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
> >  /**
> >   * sgx_record_epc_page() - Add a page to the LRU tracking
> >   * @page:      EPC page
> > + * @flags:     Reclaim flags for the page.
> >   *
> >   * Mark a page with the specified flags and add it to the
> > appropriate
> >   * (un)reclaimable list.
> > @@ -535,18 +537,19 @@ void sgx_record_epc_page(struct sgx_epc_page
> > *page, unsigned long flags)
> >  int sgx_drop_epc_page(struct sgx_epc_page *page)
> >  {
> >         spin_lock(&sgx_global_lru.lock);
> > -       if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> > -               /* The page is being reclaimed. */
> > -               if (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS)
> > {
> > -                       spin_unlock(&sgx_global_lru.lock);
> > -                       return -EBUSY;
> > -               }
> > -
> > -               page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> > +       if ((page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) &&
> > +           (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS)) {
> > +               spin_unlock(&sgx_global_lru.lock);
> > +               return -EBUSY;
> >         }
> >         list_del(&page->list);
> >         spin_unlock(&sgx_global_lru.lock);
> >  
> > +       page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED |
> > +                        SGX_EPC_PAGE_RECLAIM_IN_PROGRESS |
> > +                        SGX_EPC_PAGE_ENCLAVE |
> > +                        SGX_EPC_PAGE_VERSION_ARRAY);
> > +
> >         return 0;
> >  }
> >  
> > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h
> > b/arch/x86/kernel/cpu/sgx/sgx.h
> > index 37d66bc6ca27..ec8d567cd975 100644
> > --- a/arch/x86/kernel/cpu/sgx/sgx.h
> > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> > @@ -32,6 +32,8 @@
> >  #define SGX_EPC_PAGE_KVM_GUEST         BIT(2)
> >  /* page flag to indicate reclaim is in progress */
> >  #define SGX_EPC_PAGE_RECLAIM_IN_PROGRESS BIT(3)
> > +#define SGX_EPC_PAGE_ENCLAVE           BIT(4)
> > +#define SGX_EPC_PAGE_VERSION_ARRAY     BIT(5)
> 
> Could you please spend some time to clearly document what each bit
> means?
> 
> > +static inline void __sgx_epc_page_list_move(struct list_head
> > *list, struct sgx_epc_page *page)
> > +{
> > +       list_move_tail(&page->list, list);
> > +}
> 
> I'm not sure I get the point of a helper like this.  Why not just
> have
> the caller call list_move() directly?
> 
> >  /*
> >   * Must be called with queue lock acquired
> >   */
> > @@ -157,6 +167,38 @@ static inline void
> > sgx_epc_push_unreclaimable(struct sgx_epc_lru_lists *lrus,
> >         __sgx_epc_page_list_push(&(lrus)->unreclaimable, page);
> >  }
> >  
> > +/*
> > + * Must be called with queue lock acquired
> > + */
> > +static inline struct sgx_epc_page *
> > __sgx_epc_page_list_peek(struct list_head *list)
> > +{
> > +       struct sgx_epc_page *epc_page;
> > +
> > +       if (list_empty(list))
> > +               return NULL;
> > +
> > +       epc_page = list_first_entry(list, struct sgx_epc_page,
> > list);
> > +       return epc_page;
> > +}
> 
> list_first_entry_or_null() perhaps?
> 
> > +static inline struct sgx_epc_page *
> > +sgx_epc_peek_reclaimable(struct sgx_epc_lru_lists *lrus)
> > +{
> > +       return __sgx_epc_page_list_peek(&(lrus)->reclaimable);
> > +}
> > +
> > +static inline void sgx_epc_move_reclaimable(struct
> > sgx_epc_lru_lists *lru,
> > +                                           struct sgx_epc_page
> > *page)
> > +{
> > +       __sgx_epc_page_list_move(&(lru)->reclaimable, page);
> > +}
> > +
> > +static inline struct sgx_epc_page *
> > +sgx_epc_peek_unreclaimable(struct sgx_epc_lru_lists *lrus)
> > +{
> > +       return __sgx_epc_page_list_peek(&(lrus)->unreclaimable);
> > +}
> 
> In general, I'm not becoming more fond of these helpers as the series
> goes along.  My worry is that they're an abstraction where we don't
> *really* need one.  I don't seem them growing much functionality as
> the
> series goes along.
> 
> I'll reserve judgement until the end though.
> 


The helpers were added because Jarrko requested a queue abstraction for
the sgx_epc_lru_lists data structure in the first round of reviews. the
simple one line inlines are effectively just renaming to make the queue
abstraction more obvious to the reader.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 07/18] x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
  2022-12-05 16:33     ` Kristen Carlson Accardi
@ 2022-12-05 17:03       ` Dave Hansen
  2022-12-05 18:25         ` Kristen Carlson Accardi
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2022-12-05 17:03 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On 12/5/22 08:33, Kristen Carlson Accardi wrote:
> The helpers were added because Jarrko requested a queue abstraction for
> the sgx_epc_lru_lists data structure in the first round of reviews. the
> simple one line inlines are effectively just renaming to make the queue
> abstraction more obvious to the reader.

Jarkko,

Do you have any issues with zapping these helpers?  I really don't think
they add to readability.  The "reclaimable" versus "unreclaimable"
naming is patently obvious from the structure member names.  I'm not
sure what value it adds to have them in the function names too.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 07/18] x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
  2022-12-05 17:03       ` Dave Hansen
@ 2022-12-05 18:25         ` Kristen Carlson Accardi
  0 siblings, 0 replies; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-05 18:25 UTC (permalink / raw)
  To: Dave Hansen, jarkko, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: zhiquan1.li, Sean Christopherson

On Mon, 2022-12-05 at 09:03 -0800, Dave Hansen wrote:
> On 12/5/22 08:33, Kristen Carlson Accardi wrote:
> > The helpers were added because Jarrko requested a queue abstraction
> > for
> > the sgx_epc_lru_lists data structure in the first round of reviews.
> > the
> > simple one line inlines are effectively just renaming to make the
> > queue
> > abstraction more obvious to the reader.
> 
> Jarkko,
> 
> Do you have any issues with zapping these helpers?  I really don't
> think
> they add to readability.  The "reclaimable" versus "unreclaimable"
> naming is patently obvious from the structure member names.  I'm not
> sure what value it adds to have them in the function names too.
> 
> 


Well, there's sort of 2 things I would want clarity on before my next
revision. One is obviously deleting the wrappers for unreclaimable and
reclaimable pushes etc. The other is deleting the wrappers for the list
operations (the push/pop/peek queue abstractions) and whether those are
desired.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 08/18] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
  2022-12-02 18:36 ` [PATCH v2 08/18] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default Kristen Carlson Accardi
@ 2022-12-08  9:26   ` Jarkko Sakkinen
  2022-12-08  9:27     ` Jarkko Sakkinen
  0 siblings, 1 reply; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08  9:26 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Fri, Dec 02, 2022 at 10:36:44AM -0800, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Modify sgx_reclaim_pages() to take a parameter that specifies the
> number of pages to scan for reclaiming. Specify a max value of
> 32, but scan 16 in the usual case. This allows the number of pages
> sgx_reclaim_pages() scans to be specified by the caller, and adjusted
> in future patches.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 25 +++++++++++++++----------
>  1 file changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 83aaf5cea7b9..f201ca85212f 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -18,6 +18,8 @@
>  #include "encl.h"
>  #include "encls.h"
>  
> +#define SGX_MAX_NR_TO_RECLAIM	32

SGX_NR_TO_SCAN_MAX

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 08/18] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
  2022-12-08  9:26   ` Jarkko Sakkinen
@ 2022-12-08  9:27     ` Jarkko Sakkinen
  0 siblings, 0 replies; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08  9:27 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Thu, Dec 08, 2022 at 09:26:35AM +0000, Jarkko Sakkinen wrote:
> On Fri, Dec 02, 2022 at 10:36:44AM -0800, Kristen Carlson Accardi wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Modify sgx_reclaim_pages() to take a parameter that specifies the
> > number of pages to scan for reclaiming. Specify a max value of
> > 32, but scan 16 in the usual case. This allows the number of pages
> > sgx_reclaim_pages() scans to be specified by the caller, and adjusted
> > in future patches.
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > Cc: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/main.c | 25 +++++++++++++++----------
> >  1 file changed, 15 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > index 83aaf5cea7b9..f201ca85212f 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -18,6 +18,8 @@
> >  #include "encl.h"
> >  #include "encls.h"
> >  
> > +#define SGX_MAX_NR_TO_RECLAIM	32
> 
> SGX_NR_TO_SCAN_MAX

Would also deserve a descriptive comment.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 09/18] x86/sgx: Return the number of EPC pages that were successfully reclaimed
  2022-12-02 18:36 ` [PATCH v2 09/18] x86/sgx: Return the number of EPC pages that were successfully reclaimed Kristen Carlson Accardi
@ 2022-12-08  9:30   ` Jarkko Sakkinen
  0 siblings, 0 replies; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08  9:30 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Fri, Dec 02, 2022 at 10:36:45AM -0800, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Return the number of reclaimed pages from sgx_reclaim_pages(), the EPC
> cgroup will use the result to track the success rate of its reclaim
> calls, e.g. to escalate to a more forceful reclaiming mode if necessary.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index f201ca85212f..a4a65eadfb79 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -291,7 +291,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -static void __sgx_reclaim_pages(int nr_to_scan)
> +static int __sgx_reclaim_pages(int nr_to_scan)

Nit: I wonder if we should use ssize_t here?

If nothing else, it would document better than 'int'.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 10/18] x86/sgx: Add option to ignore age of page during EPC reclaim
  2022-12-02 18:36 ` [PATCH v2 10/18] x86/sgx: Add option to ignore age of page during EPC reclaim Kristen Carlson Accardi
@ 2022-12-08  9:37   ` Jarkko Sakkinen
  0 siblings, 0 replies; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08  9:37 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Fri, Dec 02, 2022 at 10:36:46AM -0800, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Add a flag to sgx_reclaim_pages() to instruct it to ignore the age of
> page, i.e. reclaim the page even if it's young.  The EPC cgroup will use
> the flag to enforce its limits by draining the reclaimable lists before
> resorting to other measures, e.g. forcefully reclaimable "unreclaimable"
> pages by killing enclaves.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 46 +++++++++++++++++++++-------------
>  1 file changed, 29 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index a4a65eadfb79..db96483e2e74 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -31,6 +31,10 @@ static DEFINE_XARRAY(sgx_epc_address_space);
>   * with sgx_global_lru.lock acquired.
>   */
>  static struct sgx_epc_lru_lists sgx_global_lru;

Please, separate these by an empty line.

> +static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
> +{
> +	return &sgx_global_lru;
> +}

Should be named by the thing it returns, not by the type.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 11/18] x86/sgx: Prepare for multiple LRUs
  2022-12-02 18:36 ` [PATCH v2 11/18] x86/sgx: Prepare for multiple LRUs Kristen Carlson Accardi
@ 2022-12-08  9:42   ` Jarkko Sakkinen
  0 siblings, 0 replies; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08  9:42 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Fri, Dec 02, 2022 at 10:36:47AM -0800, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Add sgx_can_reclaim() wrapper so that in a subsequent patch, multiple LRUs
> can be used cleanly.

Nit: Patch is the transient form of a change. Once a change has been
committed, it is a commit.

Further, you should instead explain why subsequent changes require
sgx_can_reclaim() wrapper, than just claim it without argument.

Alternatively, you can consider squashing this to the subsequent
patch, which makes use of the wrapper.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 12/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  2022-12-02 18:36 ` [PATCH v2 12/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Kristen Carlson Accardi
@ 2022-12-08  9:46   ` Jarkko Sakkinen
  0 siblings, 0 replies; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08  9:46 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Fri, Dec 02, 2022 at 10:36:48AM -0800, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Expose the top-level reclaim function as sgx_reclaim_epc_pages() for use
> by the upcoming EPC cgroup, which will initiate reclaim to enforce
> changes to high/max limits.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 7 ++++---
>  arch/x86/kernel/cpu/sgx/sgx.h  | 1 +
>  2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 96399e2016a8..c947b4ae06f3 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -281,6 +281,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>  
>  /**
>   * sgx_reclaim_pages() - Reclaim EPC pages from the consumers
> + * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
>   * @nr_to_scan:		 Number of EPC pages to scan for reclaim
>   * @ignore_age:		 Reclaim a page even if it is young
>   *
> @@ -385,7 +386,7 @@ static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
>  	return i;
>  }
>  
> -static int sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
> +int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age)
>  {
>  	int ret;
>  
> @@ -441,7 +442,7 @@ static int ksgxd(void *p)
>  				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>  
>  		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> -			sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
> +			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
>  	}
>  
>  	return 0;
> @@ -624,7 +625,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
>  	}
>  
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index ec8d567cd975..ce859331ddf5 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -206,6 +206,7 @@ void sgx_reclaim_direct(void);
>  void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
>  int sgx_drop_epc_page(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
> +int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age);
>  
>  void sgx_ipi_cb(void *info);
>  
> -- 
> 2.38.1
> 

Unless, there is a risk of name collision, I think this rename is
just adding unnecessary convolution to the patch set.

I would revert the rename part, and just export.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
  2022-12-02 18:36 ` [PATCH v2 13/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Kristen Carlson Accardi
@ 2022-12-08  9:56   ` Jarkko Sakkinen
  0 siblings, 0 replies; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08  9:56 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Fri, Dec 02, 2022 at 10:36:49AM -0800, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Move the isolation loop into a standalone helper, sgx_isolate_pages(),
> in preparation for existence of multiple LRUs.  Expose the helper to
> other SGX code so that it can be called from the EPC cgroup code, e.g.
> to isolate pages from a single cgroup LRU.  Exposing the isolation loop
> allows the cgroup iteration logic to be wholly encapsulated within the
> cgroup code.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 68 +++++++++++++++++++++-------------
>  arch/x86/kernel/cpu/sgx/sgx.h  |  2 +
>  2 files changed, 44 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index c947b4ae06f3..a59550fa150b 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -280,7 +280,46 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>  }
>  
>  /**
> - * sgx_reclaim_pages() - Reclaim EPC pages from the consumers
> + * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
> + * @lru:	LRU from which to reclaim
> + * @nr_to_scan:	Number of pages to scan for reclaim
> + * @dst:	Destination list to hold the isolated pages
> + */
> +void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, int *nr_to_scan,
> +			   struct list_head *dst)

Why not instead return the number of pages scanned, and pass
'int nr_to_scan'?

That would just be more idiomatic choice.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 15/18] cgroup/misc: Add per resource callbacks for css events
  2022-12-02 18:36 ` [PATCH v2 15/18] cgroup/misc: Add per resource callbacks for css events Kristen Carlson Accardi
@ 2022-12-08 14:53   ` Jarkko Sakkinen
  2022-12-08 15:15     ` Jarkko Sakkinen
  0 siblings, 1 reply; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08 14:53 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups, Zefan Li,
	Johannes Weiner, zhiquan1.li

On Fri, Dec 02, 2022 at 10:36:51AM -0800, Kristen Carlson Accardi wrote:
> Consumers of the misc cgroup controller might need to perform separate actions
> in the event of a cgroup alloc, free or release call. In addition,
> writes to the max value may also need separate action. Add the ability
> to allow downstream users to setup callbacks for these operations, and
> call the per resource type callback when appropriate.
> 
> This code will be utilized by the SGX driver in a future patch.
> 
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>

I don't know what css is.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 15/18] cgroup/misc: Add per resource callbacks for css events
  2022-12-08 14:53   ` Jarkko Sakkinen
@ 2022-12-08 15:15     ` Jarkko Sakkinen
  0 siblings, 0 replies; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08 15:15 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups, Zefan Li,
	Johannes Weiner, zhiquan1.li

On Thu, Dec 08, 2022 at 02:53:24PM +0000, Jarkko Sakkinen wrote:
> On Fri, Dec 02, 2022 at 10:36:51AM -0800, Kristen Carlson Accardi wrote:
> > Consumers of the misc cgroup controller might need to perform separate actions
> > in the event of a cgroup alloc, free or release call. In addition,
> > writes to the max value may also need separate action. Add the ability
> > to allow downstream users to setup callbacks for these operations, and
> > call the per resource type callback when appropriate.
> > 
> > This code will be utilized by the SGX driver in a future patch.
> > 
> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> I don't know what css is.

Now I know but it should be described in the commit message, i.e.
what the css is abbrevation of, and what does it mean in practice.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 14/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2022-12-02 18:36 ` [PATCH v2 14/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Kristen Carlson Accardi
@ 2022-12-08 15:21   ` Jarkko Sakkinen
  2022-12-09 16:05     ` Kristen Carlson Accardi
  0 siblings, 1 reply; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08 15:21 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Fri, Dec 02, 2022 at 10:36:50AM -0800, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Introduce the OOM path for killing an enclave with the reclaimer
> is no longer able to reclaim enough EPC pages. Find a victim enclave,
> which will be an enclave with EPC pages remaining that are not
> accessible to the reclaimer ("unreclaimable"). Once a victim is
> identified, mark the enclave as OOM and zap the enclaves entire
> page range. Release all the enclaves resources except for the
> struct sgx_encl memory itself.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>

Why this patch is dependent of all 13 patches before it?

Looks like something that is orthogonal to cgroups and could be
live by its own. At least it probably does not require all of
those patches, or does it?

Even without cgroups it would make sense to killing enclaves if
reclaimer gets stuck.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 16/18] cgroup/misc: Prepare for SGX usage
  2022-12-02 18:36 ` [PATCH v2 16/18] cgroup/misc: Prepare for SGX usage Kristen Carlson Accardi
@ 2022-12-08 15:23   ` Jarkko Sakkinen
  0 siblings, 0 replies; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08 15:23 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups, Zefan Li,
	Johannes Weiner, zhiquan1.li

"Prepare for SGX usage"?
 
What does that mean? Please make up something more descriptive.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 17/18] x86/sgx: Add support for misc cgroup controller
  2022-12-02 18:36 ` [PATCH v2 17/18] x86/sgx: Add support for misc cgroup controller Kristen Carlson Accardi
@ 2022-12-08 15:30   ` Jarkko Sakkinen
  0 siblings, 0 replies; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08 15:30 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

Perhaps "x86/sgx: Limit process EPC usage with misc cgroup controller"?

Or something more to the point than "add support".

On Fri, Dec 02, 2022 at 10:36:53AM -0800, Kristen Carlson Accardi wrote:
  
>  /**
> - * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
> + * __sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
>   * @nr_to_scan:		 Number of EPC pages to scan for reclaim
>   * @ignore_age:		 Reclaim a page even if it is young
> + * @epc_cg:		 EPC cgroup from which to reclaim
>   *
>   * Take a fixed number of pages from the head of the active page pool and
>   * reclaim them to the enclave's private shmem files. Skip the pages, which have
> @@ -336,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, int *nr_to_scan,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
> +static int __sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age,
> +			  struct sgx_epc_cgroup *epc_cg)
>  {
>  	struct sgx_backing backing[SGX_MAX_NR_TO_RECLAIM];
>  	struct sgx_epc_page *epc_page, *tmp;
> @@ -347,7 +362,15 @@ static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
>  	int i = 0;
>  	int ret;
>  
> -	sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
> +        /*
> +         * If a specific cgroup is not being targetted, take from the global
> +         * list first, even when cgroups are enabled.  If there are
> +         * pages on the global LRU then they should get reclaimed asap.
> +         */
> +        if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
> +                sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
> +
> +	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
>  
>  	if (list_empty(&iso))
>  		return 0;
> @@ -397,25 +420,33 @@ static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
>  				     SGX_EPC_PAGE_ENCLAVE |
>  				     SGX_EPC_PAGE_VERSION_ARRAY);
>  
> +		if (epc_page->epc_cg) {
> +			sgx_epc_cgroup_uncharge(epc_page->epc_cg);
> +			epc_page->epc_cg = NULL;
> +		}
> +
>  		sgx_free_epc_page(epc_page);
>  	}
>  	return i;
>  }

I would consider changes to sgx_reclaim_epc_pages() as a separate patch,
perhaps squashing with the patch that does the export. And generally
separate from this patch all internal arch/x86/kernel/cpu/sgx changes,
and leave only cgroup bindings.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 03/18] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2022-12-02 18:36 ` [PATCH v2 03/18] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s) Kristen Carlson Accardi
  2022-12-02 21:39   ` Dave Hansen
@ 2022-12-08 15:31   ` Jarkko Sakkinen
  2022-12-08 18:03     ` Kristen Carlson Accardi
  1 sibling, 1 reply; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08 15:31 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Fri, Dec 02, 2022 at 10:36:39AM -0800, Kristen Carlson Accardi wrote:
> Introduce a data structure to wrap the existing reclaimable list
> and its spinlock in a struct to minimize the code changes needed
> to handle multiple LRUs as well as reclaimable and non-reclaimable
> lists, both of which will be introduced and used by SGX EPC cgroups.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/cpu/sgx/sgx.h | 65 +++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 39cb15a8abcb..5e6d88438fae 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -90,6 +90,71 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
>  	return section->virt_addr + index * PAGE_SIZE;
>  }
>  
> +/*
> + * This data structure wraps a list of reclaimable EPC pages, and a list of
> + * non-reclaimable EPC pages and is used to implement a LRU policy during
> + * reclamation.
> + */
> +struct sgx_epc_lru_lists {
> +	spinlock_t lock;
> +	struct list_head reclaimable;
> +	struct list_head unreclaimable;
> +};
 
Why this is named like this, and not sgx_epc_global_rcu? Are there
any other use cases?

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
  2022-12-02 18:36 ` [PATCH v2 06/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages Kristen Carlson Accardi
  2022-12-02 22:15   ` Dave Hansen
@ 2022-12-08 15:46   ` Jarkko Sakkinen
  2022-12-08 18:13     ` Kristen Carlson Accardi
  1 sibling, 1 reply; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-08 15:46 UTC (permalink / raw)
  To: Kristen Carlson Accardi
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Fri, Dec 02, 2022 at 10:36:42AM -0800, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> When selecting pages to be reclaimed from the page pool (sgx_global_lru),
> the list of reclaimable pages is walked, and any page that is both
> reclaimable and not in the process of being freed is added to a list of
> potential candidates to be reclaimed. After that, this separate list is
> further examined and may or may not ultimately be reclaimed. In order
> to prevent this page from being removed from the sgx_epc_lru_lists
> struct in a separate thread by sgx_drop_epc_page(), keep track of
> whether the EPC page is in the middle of being reclaimed with
> the addtion of a RECLAIM_IN_PROGRESS flag, and do not delete the page
> off the LRU in sgx_drop_epc_page() if it has not yet finished being
> reclaimed.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 15 ++++++++++-----
>  arch/x86/kernel/cpu/sgx/sgx.h  |  2 ++
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index ecd7f8e704cc..bad72498b0a7 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -305,13 +305,15 @@ static void __sgx_reclaim_pages(void)
>  
>  		encl_page = epc_page->encl_owner;
>  
> -		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
> +		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
> +			epc_page->flags |= SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
>  			chunk[cnt++] = epc_page;
> -		else
> +		} else {
>  			/* The owner is freeing the page. No need to add the
>  			 * page back to the list of reclaimable pages.
>  			 */
>  			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +		}
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> @@ -337,6 +339,7 @@ static void __sgx_reclaim_pages(void)
>  
>  skip:
>  		spin_lock(&sgx_global_lru.lock);
> +		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
>  		sgx_epc_push_reclaimable(&sgx_global_lru, epc_page);
>  		spin_unlock(&sgx_global_lru.lock);
>  
> @@ -360,7 +363,8 @@ static void __sgx_reclaim_pages(void)
>  		sgx_reclaimer_write(epc_page, &backing[i]);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> -		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +		epc_page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED |
> +				     SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
>  
>  		sgx_free_epc_page(epc_page);
>  	}
> @@ -508,7 +512,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	WARN_ON(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	WARN_ON(page->flags & (SGX_EPC_PAGE_RECLAIMER_TRACKED |
> +			       SGX_EPC_PAGE_RECLAIM_IN_PROGRESS));
>  	page->flags |= flags;
>  	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
>  		sgx_epc_push_reclaimable(&sgx_global_lru, page);
> @@ -532,7 +537,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>  	spin_lock(&sgx_global_lru.lock);
>  	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
>  		/* The page is being reclaimed. */
> -		if (list_empty(&page->list)) {
> +		if (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS) {
>  			spin_unlock(&sgx_global_lru.lock);
>  			return -EBUSY;
>  		}
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index ba4338b7303f..37d66bc6ca27 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -30,6 +30,8 @@
>  #define SGX_EPC_PAGE_IS_FREE		BIT(1)
>  /* Pages allocated for KVM guest */
>  #define SGX_EPC_PAGE_KVM_GUEST		BIT(2)
> +/* page flag to indicate reclaim is in progress */
> +#define SGX_EPC_PAGE_RECLAIM_IN_PROGRESS BIT(3)
>  
>  struct sgx_epc_page {
>  	unsigned int section;
> -- 
> 2.38.1
> 

I would create:

enum sgx_epc_state {    
        SGX_EPC_STATE_READY = 0,
        SGX_EPC_STATE_RECLAIMER_TRACKED = 1,
        SGX_EPC_STATE_RECLAIM_IN_PROGRESS = 2,
};

I.e. not a bitmask because page should have only one state at
a time for any of this to make any sense. We have an FSM,
right?

And then allocate 2 upper bits to store this information from
flags.

And probably would make sense to have inline helper functions
to setting and getting the state that does the bitshifting and
masking shenanigangs.

This would be a patch prepending this.

In this patch you should then describe in the context of FSM
how EPC page moves between these states. With that knowledge
we can then reflect the actual code change.

The point is not to get right but more like a mindset that we
can discuss right or wrong in thee first place so just do your
best but don't stress too much.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 03/18] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2022-12-08 15:31   ` Jarkko Sakkinen
@ 2022-12-08 18:03     ` Kristen Carlson Accardi
  0 siblings, 0 replies; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-08 18:03 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Thu, 2022-12-08 at 15:31 +0000, Jarkko Sakkinen wrote:
> On Fri, Dec 02, 2022 at 10:36:39AM -0800, Kristen Carlson Accardi
> wrote:
> > Introduce a data structure to wrap the existing reclaimable list
> > and its spinlock in a struct to minimize the code changes needed
> > to handle multiple LRUs as well as reclaimable and non-reclaimable
> > lists, both of which will be introduced and used by SGX EPC
> > cgroups.
> > 
> > Signed-off-by: Sean Christopherson
> > <sean.j.christopherson@intel.com>
> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > Cc: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/sgx.h | 65
> > +++++++++++++++++++++++++++++++++++
> >  1 file changed, 65 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h
> > b/arch/x86/kernel/cpu/sgx/sgx.h
> > index 39cb15a8abcb..5e6d88438fae 100644
> > --- a/arch/x86/kernel/cpu/sgx/sgx.h
> > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> > @@ -90,6 +90,71 @@ static inline void *sgx_get_epc_virt_addr(struct
> > sgx_epc_page *page)
> >         return section->virt_addr + index * PAGE_SIZE;
> >  }
> >  
> > +/*
> > + * This data structure wraps a list of reclaimable EPC pages, and
> > a list of
> > + * non-reclaimable EPC pages and is used to implement a LRU policy
> > during
> > + * reclamation.
> > + */
> > +struct sgx_epc_lru_lists {
> > +       spinlock_t lock;
> > +       struct list_head reclaimable;
> > +       struct list_head unreclaimable;
> > +};
>  
> Why this is named like this, and not sgx_epc_global_rcu? Are there
> any other use cases?
> 
> BR, Jarkko

Yes, there are other use cases that are introduced in the other
patches. This structure is used to in the cgroup struct to hold cgroup
specific LRUs.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
  2022-12-08 15:46   ` Jarkko Sakkinen
@ 2022-12-08 18:13     ` Kristen Carlson Accardi
  0 siblings, 0 replies; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-08 18:13 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Thu, 2022-12-08 at 15:46 +0000, Jarkko Sakkinen wrote:
> On Fri, Dec 02, 2022 at 10:36:42AM -0800, Kristen Carlson Accardi
> wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > When selecting pages to be reclaimed from the page pool
> > (sgx_global_lru),
> > the list of reclaimable pages is walked, and any page that is both
> > reclaimable and not in the process of being freed is added to a
> > list of
> > potential candidates to be reclaimed. After that, this separate
> > list is
> > further examined and may or may not ultimately be reclaimed. In
> > order
> > to prevent this page from being removed from the sgx_epc_lru_lists
> > struct in a separate thread by sgx_drop_epc_page(), keep track of
> > whether the EPC page is in the middle of being reclaimed with
> > the addtion of a RECLAIM_IN_PROGRESS flag, and do not delete the
> > page
> > off the LRU in sgx_drop_epc_page() if it has not yet finished being
> > reclaimed.
> > 
> > Signed-off-by: Sean Christopherson
> > <sean.j.christopherson@intel.com>
> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > Cc: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/main.c | 15 ++++++++++-----
> >  arch/x86/kernel/cpu/sgx/sgx.h  |  2 ++
> >  2 files changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c
> > b/arch/x86/kernel/cpu/sgx/main.c
> > index ecd7f8e704cc..bad72498b0a7 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -305,13 +305,15 @@ static void __sgx_reclaim_pages(void)
> >  
> >                 encl_page = epc_page->encl_owner;
> >  
> > -               if (kref_get_unless_zero(&encl_page->encl-
> > >refcount) != 0)
> > +               if (kref_get_unless_zero(&encl_page->encl-
> > >refcount) != 0) {
> > +                       epc_page->flags |=
> > SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
> >                         chunk[cnt++] = epc_page;
> > -               else
> > +               } else {
> >                         /* The owner is freeing the page. No need
> > to add the
> >                          * page back to the list of reclaimable
> > pages.
> >                          */
> >                         epc_page->flags &=
> > ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> > +               }
> >         }
> >         spin_unlock(&sgx_global_lru.lock);
> >  
> > @@ -337,6 +339,7 @@ static void __sgx_reclaim_pages(void)
> >  
> >  skip:
> >                 spin_lock(&sgx_global_lru.lock);
> > +               epc_page->flags &=
> > ~SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
> >                 sgx_epc_push_reclaimable(&sgx_global_lru,
> > epc_page);
> >                 spin_unlock(&sgx_global_lru.lock);
> >  
> > @@ -360,7 +363,8 @@ static void __sgx_reclaim_pages(void)
> >                 sgx_reclaimer_write(epc_page, &backing[i]);
> >  
> >                 kref_put(&encl_page->encl->refcount,
> > sgx_encl_release);
> > -               epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> > +               epc_page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED
> > |
> > +                                   
> > SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
> >  
> >                 sgx_free_epc_page(epc_page);
> >         }
> > @@ -508,7 +512,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
> >  void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long
> > flags)
> >  {
> >         spin_lock(&sgx_global_lru.lock);
> > -       WARN_ON(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> > +       WARN_ON(page->flags & (SGX_EPC_PAGE_RECLAIMER_TRACKED |
> > +                              SGX_EPC_PAGE_RECLAIM_IN_PROGRESS));
> >         page->flags |= flags;
> >         if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
> >                 sgx_epc_push_reclaimable(&sgx_global_lru, page);
> > @@ -532,7 +537,7 @@ int sgx_drop_epc_page(struct sgx_epc_page
> > *page)
> >         spin_lock(&sgx_global_lru.lock);
> >         if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> >                 /* The page is being reclaimed. */
> > -               if (list_empty(&page->list)) {
> > +               if (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS)
> > {
> >                         spin_unlock(&sgx_global_lru.lock);
> >                         return -EBUSY;
> >                 }
> > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h
> > b/arch/x86/kernel/cpu/sgx/sgx.h
> > index ba4338b7303f..37d66bc6ca27 100644
> > --- a/arch/x86/kernel/cpu/sgx/sgx.h
> > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> > @@ -30,6 +30,8 @@
> >  #define SGX_EPC_PAGE_IS_FREE           BIT(1)
> >  /* Pages allocated for KVM guest */
> >  #define SGX_EPC_PAGE_KVM_GUEST         BIT(2)
> > +/* page flag to indicate reclaim is in progress */
> > +#define SGX_EPC_PAGE_RECLAIM_IN_PROGRESS BIT(3)
> >  
> >  struct sgx_epc_page {
> >         unsigned int section;
> > -- 
> > 2.38.1
> > 
> 
> I would create:
> 
> enum sgx_epc_state {    
>         SGX_EPC_STATE_READY = 0,
>         SGX_EPC_STATE_RECLAIMER_TRACKED = 1,
>         SGX_EPC_STATE_RECLAIM_IN_PROGRESS = 2,
> };
> 
> I.e. not a bitmask because page should have only one state at
> a time for any of this to make any sense. We have an FSM,
> right?
> 

I can experiment with it and see if it can work for the flags to
represent a state machine. They don't work like that right now
obviously, since you can have both RECLAIMER_TRACKED and
RECLAIM_IN_PROGRESS set at the same time.


> And then allocate 2 upper bits to store this information from
> flags.
> 
> And probably would make sense to have inline helper functions
> to setting and getting the state that does the bitshifting and
> masking shenanigangs.
> 
> This would be a patch prepending this.
> 
> In this patch you should then describe in the context of FSM
> how EPC page moves between these states. With that knowledge
> we can then reflect the actual code change.
> 
> The point is not to get right but more like a mindset that we
> can discuss right or wrong in thee first place so just do your
> best but don't stress too much.
> 
> BR, Jarkko


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 14/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2022-12-08 15:21   ` Jarkko Sakkinen
@ 2022-12-09 16:05     ` Kristen Carlson Accardi
  2022-12-09 16:22       ` Dave Hansen
  0 siblings, 1 reply; 65+ messages in thread
From: Kristen Carlson Accardi @ 2022-12-09 16:05 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On Thu, 2022-12-08 at 15:21 +0000, Jarkko Sakkinen wrote:
> On Fri, Dec 02, 2022 at 10:36:50AM -0800, Kristen Carlson Accardi
> wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Introduce the OOM path for killing an enclave with the reclaimer
> > is no longer able to reclaim enough EPC pages. Find a victim
> > enclave,
> > which will be an enclave with EPC pages remaining that are not
> > accessible to the reclaimer ("unreclaimable"). Once a victim is
> > identified, mark the enclave as OOM and zap the enclaves entire
> > page range. Release all the enclaves resources except for the
> > struct sgx_encl memory itself.
> > 
> > Signed-off-by: Sean Christopherson
> > <sean.j.christopherson@intel.com>
> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > Cc: Sean Christopherson <seanjc@google.com>
> 
> Why this patch is dependent of all 13 patches before it?
> 
> Looks like something that is orthogonal to cgroups and could be
> live by its own. At least it probably does not require all of
> those patches, or does it?
> 
> Even without cgroups it would make sense to killing enclaves if
> reclaimer gets stuck.
> 
> BR, Jarkko

It is dependent first of all of having the LRU struct with the
unreclaimable/reclaimable lists. Which means it requires storing the
enclave pointer in the page as well. It's dependent on knowing how many
pages are available, being able to ignore the age of a page etc. Right
now, without cgroups, sgx will be unable to allocate memory when an
enclave is created if it cannot reclaim enough memory from the existing
in use enclaves.

Aside from that though, I don't think that killing enclaves makes sense
outside the context of cgroup limits. Without cgroup limits, you have a
max number of EPC pages that you can have active at any one time. If an
enclave attempts to allocate a new page and the reclaimer can't free up
any, how would you decide whether it's ok to kill an entire enclave in
order to grant this other enclave the higher priority for getting a
page? With a cgroup limit, the system owner explicitly can decide what
the limits on usage will be, but without that, you'd have a situation
where one new enclave could kill others I would think. Better to just
have it the way it is - new page allocations fail if there are not free
pages, but you don't kill enclaves that already exist.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 14/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2022-12-09 16:05     ` Kristen Carlson Accardi
@ 2022-12-09 16:22       ` Dave Hansen
  2022-12-12 18:09         ` Sean Christopherson
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2022-12-09 16:22 UTC (permalink / raw)
  To: Kristen Carlson Accardi, Jarkko Sakkinen
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, zhiquan1.li, Sean Christopherson

On 12/9/22 08:05, Kristen Carlson Accardi wrote:
> Aside from that though, I don't think that killing enclaves makes sense
> outside the context of cgroup limits. 

I think it makes a lot of sense in theory.  Whatever situation we get
into with a cgroup's EPC we can also get into with the whole system's EPC.

*But*, it's orders of magnitude harder to hit on the whole system.
Basically, it has to be at a point where all of the EPC is consumed in
non-SGX-swappable page types like SECS or VEPC pages.  That's
_possible_, of course, but it's really hard to create because one VEPC
page can hold the info of several (32??) swapped-out EPC pages.

So, you'd need roughly 4GB of swapped-out normal enclave memory to
exhaust a system with 128MB of total enclave memory.

OOM handling *much* necessary in practice if you have a cgroup with some
modestly sized enclaves and a very tiny EPC limit.  If someone wants to
extend this OOM support to system-wide EPC later, then go ahead.  But, I
don't think it makes a lot of sense to invert this series for it.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 14/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2022-12-09 16:22       ` Dave Hansen
@ 2022-12-12 18:09         ` Sean Christopherson
  2022-12-26 20:43           ` Jarkko Sakkinen
  0 siblings, 1 reply; 65+ messages in thread
From: Sean Christopherson @ 2022-12-12 18:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kristen Carlson Accardi, Jarkko Sakkinen, dave.hansen, tj,
	linux-kernel, linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, zhiquan1.li

On Fri, Dec 09, 2022, Dave Hansen wrote:
> On 12/9/22 08:05, Kristen Carlson Accardi wrote:
> > Aside from that though, I don't think that killing enclaves makes sense
> > outside the context of cgroup limits. 
> 
> I think it makes a lot of sense in theory.  Whatever situation we get
> into with a cgroup's EPC we can also get into with the whole system's EPC.
> 
> *But*, it's orders of magnitude harder to hit on the whole system.

...

> If someone wants to extend this OOM support to system-wide EPC later, then go
> ahead.  But, I don't think it makes a lot of sense to invert this series for
> it.

+1 from the peanut gallery.  With VMM EPC oversubscription suport, no sane VMM
will oversubscribe VEPC pages.  And for VA pages, supporting swap of VA pages is
likely a more userspace-friendly approach if system-wide EPC OOM is a concern.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 14/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2022-12-12 18:09         ` Sean Christopherson
@ 2022-12-26 20:43           ` Jarkko Sakkinen
  0 siblings, 0 replies; 65+ messages in thread
From: Jarkko Sakkinen @ 2022-12-26 20:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kristen Carlson Accardi, dave.hansen, tj,
	linux-kernel, linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, zhiquan1.li

On Mon, Dec 12, 2022 at 06:09:39PM +0000, Sean Christopherson wrote:
> On Fri, Dec 09, 2022, Dave Hansen wrote:
> > On 12/9/22 08:05, Kristen Carlson Accardi wrote:
> > > Aside from that though, I don't think that killing enclaves makes sense
> > > outside the context of cgroup limits. 
> > 
> > I think it makes a lot of sense in theory.  Whatever situation we get
> > into with a cgroup's EPC we can also get into with the whole system's EPC.
> > 
> > *But*, it's orders of magnitude harder to hit on the whole system.
> 
> ...
> 
> > If someone wants to extend this OOM support to system-wide EPC later, then go
> > ahead.  But, I don't think it makes a lot of sense to invert this series for
> > it.
> 
> +1 from the peanut gallery.  With VMM EPC oversubscription suport, no sane VMM
> will oversubscribe VEPC pages.  And for VA pages, supporting swap of VA pages is
> likely a more userspace-friendly approach if system-wide EPC OOM is a concern.

When swapping VA pages the topology of the VA page cache for swapped VA
pages is the main question. It is compromise between how long swap-in and
swap-out can take, and how generic solution it be, meaning how deep
hierarchies you want to build, or is just a flat list of parent VA pages
"good enough".

Also, there's the question whether it should be a global cache, per cgroup
and so forth.

Implementing any solution is not overly complicated. Locking in these
options is what puzzles me.

BR, Jarkko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [EXTERNAL] [PATCH v2 00/18]  Add Cgroup support for SGX EPC memory
  2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
                   ` (17 preceding siblings ...)
  2022-12-02 18:36 ` [PATCH v2 18/18] Docs/x86/sgx: Add description for cgroup support Kristen Carlson Accardi
@ 2023-04-03 21:26 ` Anand Krishnamoorthi
  2023-04-13 18:49   ` Anand Krishnamoorthi
  18 siblings, 1 reply; 65+ messages in thread
From: Anand Krishnamoorthi @ 2023-04-03 21:26 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Bo Zhang (ACC)
  Cc: zhiquan1.li

Adding Bo Zhang to thread.

-Anand


From: Kristen Carlson Accardi <kristen@linux.intel.com>
Sent: Friday, December 2, 2022 10:36 AM
To: jarkko@kernel.org <jarkko@kernel.org>; dave.hansen@linux.intel.com <dave.hansen@linux.intel.com>; tj@kernel.org <tj@kernel.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-sgx@vger.kernel.org <linux-sgx@vger.kernel.org>; cgroups@vger.kernel.org <cgroups@vger.kernel.org>
Cc: zhiquan1.li@intel.com <zhiquan1.li@intel.com>
Subject: [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory

Utilize the Miscellaneous cgroup controller to regulate the distribution
of SGX EPC memory, which is a subset of system RAM that is used to provide
SGX-enabled applications with protected memory, and is otherwise inaccessible.

SGX EPC memory allocations are separate from normal RAM allocations,
and is managed solely by the SGX subsystem. The existing cgroup memory
controller cannot be used to limit or account for SGX EPC memory.

This patchset implements the support for sgx_epc memory within the
misc cgroup controller, and then utilizes the misc cgroup controller
to provide support for setting the total system capacity, max limit
per cgroup, and events.

This work was originally authored by Sean Christopherson a few years ago,
and was modified to work with more recent kernels, and to utilize the
misc cgroup controller rather than a custom controller. It is currently
based on top of the MCA patches.

Here's the MCA patchset for reference.
https://lore.kernel.org/linux-sgx/2d52c8c4-8ed0-6df2-2911-da5b9fcc9ae4@intel.com/T/#t

The patchset adds support for multiple LRUs to track both reclaimable
EPC pages (i.e. pages the reclaimer knows about), as well as unreclaimable
EPC pages (i.e. pages which the reclaimer isn't aware of, such as va pages).
These pages are assigned to an LRU, as well as an enclave, so that an
enclave's full EPC usage can be tracked, and limited to a max value. During
OOM events, an enclave can be have its memory zapped, and all the EPC pages
not tracked by the reclaimer can be freed.

I appreciate your comments and feedback.

Changelog:

v2:
 * rename struct sgx_epc_lru to sgx_epc_lru_lists to be more clear
   that this struct contains 2 lists.
 * use inline functions rather than macros for sgx_epc_page_list*
   wrappers.
 * Remove flags macros and open code all flags.
 * Improve the commit message for RECLAIM_IN_PROGRESS patch to make
   it more clear what the patch does.
 * remove notifier_block from misc cgroup changes and use a set
   of ops for callbacks instead.
 * rename root_misc to misc_cg_root and parent_misc to misc_cg_parent
 * consolidate misc cgroup changes to 2 patches and remove most of
   the previous helper functions.

Kristen Carlson Accardi (7):
  x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  x86/sgx: Use sgx_epc_lru_lists for existing active page list
  x86/sgx: Track epc pages on reclaimable or unreclaimable lists
  cgroup/misc: Add per resource callbacks for css events
  cgroup/misc: Prepare for SGX usage
  x86/sgx: Add support for misc cgroup controller
  Docs/x86/sgx: Add description for cgroup support

Sean Christopherson (11):
  x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
  x86/sgx: Store struct sgx_encl when allocating new VA pages
  x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
  x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
  x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
  x86/sgx: Return the number of EPC pages that were successfully
    reclaimed
  x86/sgx: Add option to ignore age of page during EPC reclaim
  x86/sgx: Prepare for multiple LRUs
  x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
  x86/sgx: Add EPC OOM path to forcefully reclaim EPC

 Documentation/x86/sgx.rst            |  77 ++++
 arch/x86/Kconfig                     |  13 +
 arch/x86/kernel/cpu/sgx/Makefile     |   1 +
 arch/x86/kernel/cpu/sgx/encl.c       |  90 ++++-
 arch/x86/kernel/cpu/sgx/encl.h       |   4 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 539 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 +++
 arch/x86/kernel/cpu/sgx/ioctl.c      |  14 +-
 arch/x86/kernel/cpu/sgx/main.c       | 412 ++++++++++++++++----
 arch/x86/kernel/cpu/sgx/sgx.h        | 122 +++++-
 arch/x86/kernel/cpu/sgx/virt.c       |  28 +-
 include/linux/misc_cgroup.h          |  35 ++
 kernel/cgroup/misc.c                 |  76 +++-
 13 files changed, 1341 insertions(+), 129 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

--
2.38.1

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [EXTERNAL] [PATCH v2 00/18]  Add Cgroup support for SGX EPC memory
  2023-04-03 21:26 ` [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Anand Krishnamoorthi
@ 2023-04-13 18:49   ` Anand Krishnamoorthi
  2023-04-18 16:44     ` Mikko Ylinen
  0 siblings, 1 reply; 65+ messages in thread
From: Anand Krishnamoorthi @ 2023-04-13 18:49 UTC (permalink / raw)
  To: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Bo Zhang (ACC)
  Cc: zhiquan1.li

For Azure, SGX cgroup support feature is very useful.
It is needed to enforce the EPC resource limitation of Kubernetes pods on SGX nodes.

Today, in Azure Kubernetes Service, each pod on SGX node claims a nominal EPC memory requirement. K8s will track the unclaimed EPC memories on SGX nodes to schedule pods.
However, there's no enforcement on the node whether the pod uses more EPC memory than what it claims. If EPC is running out on the node, the kernel will do EPC paging, which will cause all pods suffering performance degradation.

Cgroup support for EPC will enforce EPC resource limitation on pod level, so that when a pod tries to use more EPC than what it claims, it will be EPC paged while other pods are not affected.

-Anand

From: Anand Krishnamoorthi <anakrish@microsoft.com>
Sent: Monday, April 3, 2023 2:26 PM
To: Kristen Carlson Accardi <kristen@linux.intel.com>; jarkko@kernel.org <jarkko@kernel.org>; dave.hansen@linux.intel.com <dave.hansen@linux.intel.com>; tj@kernel.org <tj@kernel.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-sgx@vger.kernel.org <linux-sgx@vger.kernel.org>; cgroups@vger.kernel.org <cgroups@vger.kernel.org>; Bo Zhang (ACC) <zhanb@microsoft.com>
Cc: zhiquan1.li@intel.com <zhiquan1.li@intel.com>
Subject: Re: [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory

Adding Bo Zhang to thread.

-Anand


From: Kristen Carlson Accardi <kristen@linux.intel.com>
Sent: Friday, December 2, 2022 10:36 AM
To: jarkko@kernel.org <jarkko@kernel.org>; dave.hansen@linux.intel.com <dave.hansen@linux.intel.com>; tj@kernel.org <tj@kernel.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-sgx@vger.kernel.org <linux-sgx@vger.kernel.org>; cgroups@vger.kernel.org <cgroups@vger.kernel.org>
Cc: zhiquan1.li@intel.com <zhiquan1.li@intel.com>
Subject: [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory

Utilize the Miscellaneous cgroup controller to regulate the distribution
of SGX EPC memory, which is a subset of system RAM that is used to provide
SGX-enabled applications with protected memory, and is otherwise inaccessible.

SGX EPC memory allocations are separate from normal RAM allocations,
and is managed solely by the SGX subsystem. The existing cgroup memory
controller cannot be used to limit or account for SGX EPC memory.

This patchset implements the support for sgx_epc memory within the
misc cgroup controller, and then utilizes the misc cgroup controller
to provide support for setting the total system capacity, max limit
per cgroup, and events.

This work was originally authored by Sean Christopherson a few years ago,
and was modified to work with more recent kernels, and to utilize the
misc cgroup controller rather than a custom controller. It is currently
based on top of the MCA patches.

Here's the MCA patchset for reference.
https://lore.kernel.org/linux-sgx/2d52c8c4-8ed0-6df2-2911-da5b9fcc9ae4@intel.com/T/#t

The patchset adds support for multiple LRUs to track both reclaimable
EPC pages (i.e. pages the reclaimer knows about), as well as unreclaimable
EPC pages (i.e. pages which the reclaimer isn't aware of, such as va pages).
These pages are assigned to an LRU, as well as an enclave, so that an
enclave's full EPC usage can be tracked, and limited to a max value. During
OOM events, an enclave can be have its memory zapped, and all the EPC pages
not tracked by the reclaimer can be freed.

I appreciate your comments and feedback.

Changelog:

v2:
 * rename struct sgx_epc_lru to sgx_epc_lru_lists to be more clear
   that this struct contains 2 lists.
 * use inline functions rather than macros for sgx_epc_page_list*
   wrappers.
 * Remove flags macros and open code all flags.
 * Improve the commit message for RECLAIM_IN_PROGRESS patch to make
   it more clear what the patch does.
 * remove notifier_block from misc cgroup changes and use a set
   of ops for callbacks instead.
 * rename root_misc to misc_cg_root and parent_misc to misc_cg_parent
 * consolidate misc cgroup changes to 2 patches and remove most of
   the previous helper functions.

Kristen Carlson Accardi (7):
  x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  x86/sgx: Use sgx_epc_lru_lists for existing active page list
  x86/sgx: Track epc pages on reclaimable or unreclaimable lists
  cgroup/misc: Add per resource callbacks for css events
  cgroup/misc: Prepare for SGX usage
  x86/sgx: Add support for misc cgroup controller
  Docs/x86/sgx: Add description for cgroup support

Sean Christopherson (11):
  x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
  x86/sgx: Store struct sgx_encl when allocating new VA pages
  x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
  x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
  x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
  x86/sgx: Return the number of EPC pages that were successfully
    reclaimed
  x86/sgx: Add option to ignore age of page during EPC reclaim
  x86/sgx: Prepare for multiple LRUs
  x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
  x86/sgx: Add EPC OOM path to forcefully reclaim EPC

 Documentation/x86/sgx.rst            |  77 ++++
 arch/x86/Kconfig                     |  13 +
 arch/x86/kernel/cpu/sgx/Makefile     |   1 +
 arch/x86/kernel/cpu/sgx/encl.c       |  90 ++++-
 arch/x86/kernel/cpu/sgx/encl.h       |   4 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 539 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 +++
 arch/x86/kernel/cpu/sgx/ioctl.c      |  14 +-
 arch/x86/kernel/cpu/sgx/main.c       | 412 ++++++++++++++++----
 arch/x86/kernel/cpu/sgx/sgx.h        | 122 +++++-
 arch/x86/kernel/cpu/sgx/virt.c       |  28 +-
 include/linux/misc_cgroup.h          |  35 ++
 kernel/cgroup/misc.c                 |  76 +++-
 13 files changed, 1341 insertions(+), 129 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

--
2.38.1

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [EXTERNAL] [PATCH v2 00/18]  Add Cgroup support for SGX EPC memory
  2023-04-13 18:49   ` Anand Krishnamoorthi
@ 2023-04-18 16:44     ` Mikko Ylinen
  2023-04-27 16:53       ` Anand Krishnamoorthi
  0 siblings, 1 reply; 65+ messages in thread
From: Mikko Ylinen @ 2023-04-18 16:44 UTC (permalink / raw)
  To: Anand Krishnamoorthi
  Cc: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Bo Zhang (ACC),
	zhiquan1.li

Hi,

On Thu, Apr 13, 2023 at 06:49:53PM +0000, Anand Krishnamoorthi wrote:
> For Azure, SGX cgroup support feature is very useful.
> It is needed to enforce the EPC resource limitation of Kubernetes pods on SGX nodes.

I've been working on enabling the same use case with the difference that
I'm setting per container EPC limits (instead of pods). The Open Container
Initiative (OCI) runtime spec [1] defines how it's done and with the misc
controller implemented here "misc.max": "sgx_epc 42" setting for a container
is supported by runc out of the box.

In addition to being able to set limits per container/pod, the cgroup for
SGX EPC helps to build better telemetry/monitoring for EPC consumtion.

[1] https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#unified

> 
> Today, in Azure Kubernetes Service, each pod on SGX node claims a nominal EPC memory requirement. K8s will track the unclaimed EPC memories on SGX nodes to schedule pods.
> However, there's no enforcement on the node whether the pod uses more EPC memory than what it claims. If EPC is running out on the node, the kernel will do EPC paging, which will cause all pods suffering performance degradation.
> 
> Cgroup support for EPC will enforce EPC resource limitation on pod level, so that when a pod tries to use more EPC than what it claims, it will be EPC paged while other pods are not affected.
> 
> -Anand
> 
> From: Anand Krishnamoorthi <anakrish@microsoft.com>
> Sent: Monday, April 3, 2023 2:26 PM
> To: Kristen Carlson Accardi <kristen@linux.intel.com>; jarkko@kernel.org <jarkko@kernel.org>; dave.hansen@linux.intel.com <dave.hansen@linux.intel.com>; tj@kernel.org <tj@kernel.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-sgx@vger.kernel.org <linux-sgx@vger.kernel.org>; cgroups@vger.kernel.org <cgroups@vger.kernel.org>; Bo Zhang (ACC) <zhanb@microsoft.com>
> Cc: zhiquan1.li@intel.com <zhiquan1.li@intel.com>
> Subject: Re: [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory
> 
> Adding Bo Zhang to thread.
> 
> -Anand
> 
> 
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> Sent: Friday, December 2, 2022 10:36 AM
> To: jarkko@kernel.org <jarkko@kernel.org>; dave.hansen@linux.intel.com <dave.hansen@linux.intel.com>; tj@kernel.org <tj@kernel.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-sgx@vger.kernel.org <linux-sgx@vger.kernel.org>; cgroups@vger.kernel.org <cgroups@vger.kernel.org>
> Cc: zhiquan1.li@intel.com <zhiquan1.li@intel.com>
> Subject: [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory
> 
> Utilize the Miscellaneous cgroup controller to regulate the distribution
> of SGX EPC memory, which is a subset of system RAM that is used to provide
> SGX-enabled applications with protected memory, and is otherwise inaccessible.
> 
> SGX EPC memory allocations are separate from normal RAM allocations,
> and is managed solely by the SGX subsystem. The existing cgroup memory
> controller cannot be used to limit or account for SGX EPC memory.
> 
> This patchset implements the support for sgx_epc memory within the
> misc cgroup controller, and then utilizes the misc cgroup controller
> to provide support for setting the total system capacity, max limit
> per cgroup, and events.
> 
> This work was originally authored by Sean Christopherson a few years ago,
> and was modified to work with more recent kernels, and to utilize the
> misc cgroup controller rather than a custom controller. It is currently
> based on top of the MCA patches.
> 
> Here's the MCA patchset for reference.
> https://lore.kernel.org/linux-sgx/2d52c8c4-8ed0-6df2-2911-da5b9fcc9ae4@intel.com/T/#t
> 
> The patchset adds support for multiple LRUs to track both reclaimable
> EPC pages (i.e. pages the reclaimer knows about), as well as unreclaimable
> EPC pages (i.e. pages which the reclaimer isn't aware of, such as va pages).
> These pages are assigned to an LRU, as well as an enclave, so that an
> enclave's full EPC usage can be tracked, and limited to a max value. During
> OOM events, an enclave can be have its memory zapped, and all the EPC pages
> not tracked by the reclaimer can be freed.
> 
> I appreciate your comments and feedback.
> 
> Changelog:
> 
> v2:
>  * rename struct sgx_epc_lru to sgx_epc_lru_lists to be more clear
>    that this struct contains 2 lists.
>  * use inline functions rather than macros for sgx_epc_page_list*
>    wrappers.
>  * Remove flags macros and open code all flags.
>  * Improve the commit message for RECLAIM_IN_PROGRESS patch to make
>    it more clear what the patch does.
>  * remove notifier_block from misc cgroup changes and use a set
>    of ops for callbacks instead.
>  * rename root_misc to misc_cg_root and parent_misc to misc_cg_parent
>  * consolidate misc cgroup changes to 2 patches and remove most of
>    the previous helper functions.
> 
> Kristen Carlson Accardi (7):
>   x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
>   x86/sgx: Use sgx_epc_lru_lists for existing active page list
>   x86/sgx: Track epc pages on reclaimable or unreclaimable lists
>   cgroup/misc: Add per resource callbacks for css events
>   cgroup/misc: Prepare for SGX usage
>   x86/sgx: Add support for misc cgroup controller
>   Docs/x86/sgx: Add description for cgroup support
> 
> Sean Christopherson (11):
>   x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
>   x86/sgx: Store struct sgx_encl when allocating new VA pages
>   x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
>   x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
>   x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
>   x86/sgx: Return the number of EPC pages that were successfully
>     reclaimed
>   x86/sgx: Add option to ignore age of page during EPC reclaim
>   x86/sgx: Prepare for multiple LRUs
>   x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
>   x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
>   x86/sgx: Add EPC OOM path to forcefully reclaim EPC
> 
>  Documentation/x86/sgx.rst            |  77 ++++
>  arch/x86/Kconfig                     |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile     |   1 +
>  arch/x86/kernel/cpu/sgx/encl.c       |  90 ++++-
>  arch/x86/kernel/cpu/sgx/encl.h       |   4 +-
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 539 +++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c      |  14 +-
>  arch/x86/kernel/cpu/sgx/main.c       | 412 ++++++++++++++++----
>  arch/x86/kernel/cpu/sgx/sgx.h        | 122 +++++-
>  arch/x86/kernel/cpu/sgx/virt.c       |  28 +-
>  include/linux/misc_cgroup.h          |  35 ++
>  kernel/cgroup/misc.c                 |  76 +++-
>  13 files changed, 1341 insertions(+), 129 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
> 
> --
> 2.38.1

-- 
Regards, Mikko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [EXTERNAL] [PATCH v2 00/18]  Add Cgroup support for SGX EPC memory
  2023-04-18 16:44     ` Mikko Ylinen
@ 2023-04-27 16:53       ` Anand Krishnamoorthi
  0 siblings, 0 replies; 65+ messages in thread
From: Anand Krishnamoorthi @ 2023-04-27 16:53 UTC (permalink / raw)
  To: Mikko Ylinen, Liz Zhang
  Cc: Kristen Carlson Accardi, jarkko, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Bo Zhang (ACC),
	zhiquan1.li

Adding Liz Zhang.




From: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Sent: Tuesday, April 18, 2023 9:44 AM
To: Anand Krishnamoorthi <anakrish@microsoft.com>
Cc: Kristen Carlson Accardi <kristen@linux.intel.com>; jarkko@kernel.org <jarkko@kernel.org>; dave.hansen@linux.intel.com <dave.hansen@linux.intel.com>; tj@kernel.org <tj@kernel.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-sgx@vger.kernel.org <linux-sgx@vger.kernel.org>; cgroups@vger.kernel.org <cgroups@vger.kernel.org>; Bo Zhang (ACC) <zhanb@microsoft.com>; zhiquan1.li@intel.com <zhiquan1.li@intel.com>
Subject: Re: [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory

[Some people who received this message don't often get email from mikko.ylinen@linux.intel.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Hi,

On Thu, Apr 13, 2023 at 06:49:53PM +0000, Anand Krishnamoorthi wrote:
> For Azure, SGX cgroup support feature is very useful.
> It is needed to enforce the EPC resource limitation of Kubernetes pods on SGX nodes.

I've been working on enabling the same use case with the difference that
I'm setting per container EPC limits (instead of pods). The Open Container
Initiative (OCI) runtime spec [1] defines how it's done and with the misc
controller implemented here "misc.max": "sgx_epc 42" setting for a container
is supported by runc out of the box.

In addition to being able to set limits per container/pod, the cgroup for
SGX EPC helps to build better telemetry/monitoring for EPC consumtion.

[1] https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#unified

>
> Today, in Azure Kubernetes Service, each pod on SGX node claims a nominal EPC memory requirement. K8s will track the unclaimed EPC memories on SGX nodes to schedule pods.
> However, there's no enforcement on the node whether the pod uses more EPC memory than what it claims. If EPC is running out on the node, the kernel will do EPC paging, which will cause all pods suffering performance degradation.
>
> Cgroup support for EPC will enforce EPC resource limitation on pod level, so that when a pod tries to use more EPC than what it claims, it will be EPC paged while other pods are not affected.
>
> -Anand
>
> From: Anand Krishnamoorthi <anakrish@microsoft.com>
> Sent: Monday, April 3, 2023 2:26 PM
> To: Kristen Carlson Accardi <kristen@linux.intel.com>; jarkko@kernel.org <jarkko@kernel.org>; dave.hansen@linux.intel.com <dave.hansen@linux.intel.com>; tj@kernel.org <tj@kernel.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-sgx@vger.kernel.org <linux-sgx@vger.kernel.org>; cgroups@vger.kernel.org <cgroups@vger.kernel.org>; Bo Zhang (ACC) <zhanb@microsoft.com>
> Cc: zhiquan1.li@intel.com <zhiquan1.li@intel.com>
> Subject: Re: [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory
>
> Adding Bo Zhang to thread.
>
> -Anand
>
>
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> Sent: Friday, December 2, 2022 10:36 AM
> To: jarkko@kernel.org <jarkko@kernel.org>; dave.hansen@linux.intel.com <dave.hansen@linux.intel.com>; tj@kernel.org <tj@kernel.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-sgx@vger.kernel.org <linux-sgx@vger.kernel.org>; cgroups@vger.kernel.org <cgroups@vger.kernel.org>
> Cc: zhiquan1.li@intel.com <zhiquan1.li@intel.com>
> Subject: [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory
>
> Utilize the Miscellaneous cgroup controller to regulate the distribution
> of SGX EPC memory, which is a subset of system RAM that is used to provide
> SGX-enabled applications with protected memory, and is otherwise inaccessible.
>
> SGX EPC memory allocations are separate from normal RAM allocations,
> and is managed solely by the SGX subsystem. The existing cgroup memory
> controller cannot be used to limit or account for SGX EPC memory.
>
> This patchset implements the support for sgx_epc memory within the
> misc cgroup controller, and then utilizes the misc cgroup controller
> to provide support for setting the total system capacity, max limit
> per cgroup, and events.
>
> This work was originally authored by Sean Christopherson a few years ago,
> and was modified to work with more recent kernels, and to utilize the
> misc cgroup controller rather than a custom controller. It is currently
> based on top of the MCA patches.
>
> Here's the MCA patchset for reference.
> https://lore.kernel.org/linux-sgx/2d52c8c4-8ed0-6df2-2911-da5b9fcc9ae4@intel.com/T/#t
>
> The patchset adds support for multiple LRUs to track both reclaimable
> EPC pages (i.e. pages the reclaimer knows about), as well as unreclaimable
> EPC pages (i.e. pages which the reclaimer isn't aware of, such as va pages).
> These pages are assigned to an LRU, as well as an enclave, so that an
> enclave's full EPC usage can be tracked, and limited to a max value. During
> OOM events, an enclave can be have its memory zapped, and all the EPC pages
> not tracked by the reclaimer can be freed.
>
> I appreciate your comments and feedback.
>
> Changelog:
>
> v2:
>  * rename struct sgx_epc_lru to sgx_epc_lru_lists to be more clear
>    that this struct contains 2 lists.
>  * use inline functions rather than macros for sgx_epc_page_list*
>    wrappers.
>  * Remove flags macros and open code all flags.
>  * Improve the commit message for RECLAIM_IN_PROGRESS patch to make
>    it more clear what the patch does.
>  * remove notifier_block from misc cgroup changes and use a set
>    of ops for callbacks instead.
>  * rename root_misc to misc_cg_root and parent_misc to misc_cg_parent
>  * consolidate misc cgroup changes to 2 patches and remove most of
>    the previous helper functions.
>
> Kristen Carlson Accardi (7):
>   x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
>   x86/sgx: Use sgx_epc_lru_lists for existing active page list
>   x86/sgx: Track epc pages on reclaimable or unreclaimable lists
>   cgroup/misc: Add per resource callbacks for css events
>   cgroup/misc: Prepare for SGX usage
>   x86/sgx: Add support for misc cgroup controller
>   Docs/x86/sgx: Add description for cgroup support
>
> Sean Christopherson (11):
>   x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
>   x86/sgx: Store struct sgx_encl when allocating new VA pages
>   x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
>   x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
>   x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
>   x86/sgx: Return the number of EPC pages that were successfully
>     reclaimed
>   x86/sgx: Add option to ignore age of page during EPC reclaim
>   x86/sgx: Prepare for multiple LRUs
>   x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
>   x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
>   x86/sgx: Add EPC OOM path to forcefully reclaim EPC
>
>  Documentation/x86/sgx.rst            |  77 ++++
>  arch/x86/Kconfig                     |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile     |   1 +
>  arch/x86/kernel/cpu/sgx/encl.c       |  90 ++++-
>  arch/x86/kernel/cpu/sgx/encl.h       |   4 +-
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 539 +++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c      |  14 +-
>  arch/x86/kernel/cpu/sgx/main.c       | 412 ++++++++++++++++----
>  arch/x86/kernel/cpu/sgx/sgx.h        | 122 +++++-
>  arch/x86/kernel/cpu/sgx/virt.c       |  28 +-
>  include/linux/misc_cgroup.h          |  35 ++
>  kernel/cgroup/misc.c                 |  76 +++-
>  13 files changed, 1341 insertions(+), 129 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> --
> 2.38.1

--
Regards, Mikko

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2023-04-27 16:54 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-02 18:36 [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Kristen Carlson Accardi
2022-12-02 18:36 ` [PATCH v2 01/18] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages() Kristen Carlson Accardi
2022-12-02 21:33   ` Dave Hansen
2022-12-02 21:37     ` Kristen Carlson Accardi
2022-12-02 21:45       ` Dave Hansen
2022-12-02 22:17         ` Kristen Carlson Accardi
2022-12-02 22:37           ` Dave Hansen
2022-12-02 18:36 ` [PATCH v2 02/18] x86/sgx: Store struct sgx_encl when allocating new VA pages Kristen Carlson Accardi
2022-12-02 21:35   ` Dave Hansen
2022-12-02 21:40     ` Kristen Carlson Accardi
2022-12-02 21:48       ` Dave Hansen
2022-12-02 22:35         ` Sean Christopherson
2022-12-02 22:47           ` Dave Hansen
2022-12-02 22:49             ` Sean Christopherson
2022-12-02 18:36 ` [PATCH v2 03/18] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s) Kristen Carlson Accardi
2022-12-02 21:39   ` Dave Hansen
2022-12-08 15:31   ` Jarkko Sakkinen
2022-12-08 18:03     ` Kristen Carlson Accardi
2022-12-02 18:36 ` [PATCH v2 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list Kristen Carlson Accardi
2022-12-02 21:43   ` Dave Hansen
2022-12-02 21:51     ` Kristen Carlson Accardi
2022-12-02 22:10       ` Dave Hansen
2022-12-02 18:36 ` [PATCH v2 05/18] x86/sgx: Track epc pages on reclaimable or unreclaimable lists Kristen Carlson Accardi
2022-12-02 22:13   ` Dave Hansen
2022-12-02 22:28     ` Sean Christopherson
2022-12-02 18:36 ` [PATCH v2 06/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages Kristen Carlson Accardi
2022-12-02 22:15   ` Dave Hansen
2022-12-08 15:46   ` Jarkko Sakkinen
2022-12-08 18:13     ` Kristen Carlson Accardi
2022-12-02 18:36 ` [PATCH v2 07/18] x86/sgx: Use a list to track to-be-reclaimed pages during reclaim Kristen Carlson Accardi
2022-12-02 22:33   ` Dave Hansen
2022-12-05 16:33     ` Kristen Carlson Accardi
2022-12-05 17:03       ` Dave Hansen
2022-12-05 18:25         ` Kristen Carlson Accardi
2022-12-02 18:36 ` [PATCH v2 08/18] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default Kristen Carlson Accardi
2022-12-08  9:26   ` Jarkko Sakkinen
2022-12-08  9:27     ` Jarkko Sakkinen
2022-12-02 18:36 ` [PATCH v2 09/18] x86/sgx: Return the number of EPC pages that were successfully reclaimed Kristen Carlson Accardi
2022-12-08  9:30   ` Jarkko Sakkinen
2022-12-02 18:36 ` [PATCH v2 10/18] x86/sgx: Add option to ignore age of page during EPC reclaim Kristen Carlson Accardi
2022-12-08  9:37   ` Jarkko Sakkinen
2022-12-02 18:36 ` [PATCH v2 11/18] x86/sgx: Prepare for multiple LRUs Kristen Carlson Accardi
2022-12-08  9:42   ` Jarkko Sakkinen
2022-12-02 18:36 ` [PATCH v2 12/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Kristen Carlson Accardi
2022-12-08  9:46   ` Jarkko Sakkinen
2022-12-02 18:36 ` [PATCH v2 13/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Kristen Carlson Accardi
2022-12-08  9:56   ` Jarkko Sakkinen
2022-12-02 18:36 ` [PATCH v2 14/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Kristen Carlson Accardi
2022-12-08 15:21   ` Jarkko Sakkinen
2022-12-09 16:05     ` Kristen Carlson Accardi
2022-12-09 16:22       ` Dave Hansen
2022-12-12 18:09         ` Sean Christopherson
2022-12-26 20:43           ` Jarkko Sakkinen
2022-12-02 18:36 ` [PATCH v2 15/18] cgroup/misc: Add per resource callbacks for css events Kristen Carlson Accardi
2022-12-08 14:53   ` Jarkko Sakkinen
2022-12-08 15:15     ` Jarkko Sakkinen
2022-12-02 18:36 ` [PATCH v2 16/18] cgroup/misc: Prepare for SGX usage Kristen Carlson Accardi
2022-12-08 15:23   ` Jarkko Sakkinen
2022-12-02 18:36 ` [PATCH v2 17/18] x86/sgx: Add support for misc cgroup controller Kristen Carlson Accardi
2022-12-08 15:30   ` Jarkko Sakkinen
2022-12-02 18:36 ` [PATCH v2 18/18] Docs/x86/sgx: Add description for cgroup support Kristen Carlson Accardi
2023-04-03 21:26 ` [EXTERNAL] [PATCH v2 00/18] Add Cgroup support for SGX EPC memory Anand Krishnamoorthi
2023-04-13 18:49   ` Anand Krishnamoorthi
2023-04-18 16:44     ` Mikko Ylinen
2023-04-27 16:53       ` Anand Krishnamoorthi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).