[PATCH v2 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows
@ 2024-05-15 13:12 Dmitrii Kuvaiskii
  2024-05-15 13:12 ` [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
  2024-05-15 13:12 ` [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
  0 siblings, 2 replies; 13+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-05-15 13:12 UTC (permalink / raw)
  To: dave.hansen, jarkko, kai.huang, haitao.huang, reinette.chatre,
	linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin

SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
enclave pages and may support MADV_DONTNEED semantics [1]. The former
implies #PF-based page allocation, and the latter implies the usage of
SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.

EDMM-based lazy allocation and MADV_DONTNEED semantics provide
significant performance improvement for some workloads that run on
Gramine. For example, a Java workload with a 16GB enclave size has
approx. 57x improvement in total runtime. Thus, we consider it important
to permit these optimizations in Gramine. However, we observed hangs of
applications (Node.js, PyTorch, R, iperf, Blender, Nginx) when run on
Gramine with EDMM, lazy allocation and MADV_DONTNEED features enabled.

We wrote a trivial stress test to reproduce the hangs observed in
real-world applications. The test stresses #PF-based page allocation and
SGX_IOC_ENCLAVE_REMOVE_PAGES flows in the SGX driver:

/* repeatedly touch different enclave pages at random and mix with
 * madvise(MADV_DONTNEED) to stress EAUG/EREMOVE flows */
static void* thread_func(void* arg) {
    size_t num_pages = 0xA000 / page_size;
    for (int i = 0; i < 5000; i++) {
        size_t page = get_random_ulong() % num_pages;
        char data = READ_ONCE(((char*)arg)[page * page_size]);

        page = get_random_ulong() % num_pages;
        madvise(arg + page * page_size, page_size, MADV_DONTNEED);
    }
}

addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
pthread_t threads[16];
for (int i = 0; i < 16; i++)
    pthread_create(&threads[i], NULL, thread_func, addr);

This test uncovers two data races in the SGX driver. The remaining
patches describe and fix these races.

I performed several stress tests to verify that there are no other data
races (at least with the test program above):

- On Icelake server with 128GB of PRM, without madvise(). This stresses
  the first data race. A Gramine SGX test suite running in the
  background for additional stressing. Result: 1,000 runs without hangs
  (result without the first bug fix: hangs every time).
- On Icelake server with 128GB of PRM, with madvise(). This stresses the
  second data race. A Gramine SGX test suite running in the background
  for additional stressing. Result: 1,000 runs without hangs (result
  with the first bug fix but without the second bug fix: hangs approx.
  once in 50 runs).
- On Icelake server with 4GB of PRM, with madvise(). This additionally
  stresses the enclave page swapping flows. Two Gramine SGX test suites
  running in the background for additional stressing of swapping (I
  observe 100% CPU utilization from ksgxd which confirms that swapping
  happens). Result: 1,000 runs without hangs.

[1] https://github.com/gramineproject/gramine/pull/1513

v1 -> v2:
- No changes in code itself
- Expanded cover letter
- Added CPU1 vs CPU2 race scenarios in commit messages

v1: https://lore.kernel.org/all/20240429104330.3636113-3-dmitrii.kuvaiskii@intel.com/

Dmitrii Kuvaiskii (2):
  x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  x86/sgx: Resolve EREMOVE page vs EAUG page data race

 arch/x86/kernel/cpu/sgx/encl.c  | 10 +++++++---
 arch/x86/kernel/cpu/sgx/encl.h  |  3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c |  1 +
 3 files changed, 11 insertions(+), 3 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  2024-05-15 13:12 [PATCH v2 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows Dmitrii Kuvaiskii
@ 2024-05-15 13:12 ` Dmitrii Kuvaiskii
  2024-05-15 13:54   ` Jarkko Sakkinen
                     ` (2 more replies)
  2024-05-15 13:12 ` [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
  1 sibling, 3 replies; 13+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-05-15 13:12 UTC (permalink / raw)
  To: dave.hansen, jarkko, kai.huang, haitao.huang, reinette.chatre,
	linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka

Two enclave threads may try to access the same non-present enclave page
simultaneously (e.g., if the SGX runtime supports lazy allocation). The
threads will end up in sgx_encl_eaug_page(), racing to acquire the
enclave lock. The winning thread will perform EAUG, set up the page
table entry, and insert the page into encl->page_array. The losing
thread will then get -EBUSY on xa_insert(&encl->page_array) and proceed
to error handling path.

This race condition can be illustrated as follows:

/*                             /*
 * Fault on CPU1                * Fault on CPU2
 * on enclave page X            * on enclave page X
 */                             */
sgx_vma_fault() {              sgx_vma_fault() {

  xa_load(&encl->page_array)     xa_load(&encl->page_array)
      == NULL -->                    == NULL -->

  sgx_encl_eaug_page() {         sgx_encl_eaug_page() {

    ...                            ...

    /*                             /*
     * alloc encl_page              * alloc encl_page
     */                             */
                                   mutex_lock(&encl->lock);
                                   /*
                                    * alloc EPC page
                                    */
                                   epc_page = sgx_alloc_epc_page(...);
                                   /*
                                    * add page to enclave's xarray
                                    */
                                   xa_insert(&encl->page_array, ...);
                                   /*
                                    * add page to enclave via EAUG
                                    * (page is in pending state)
                                    */
                                   /*
                                    * add PTE entry
                                    */
                                   vmf_insert_pfn(...);

                                   mutex_unlock(&encl->lock);
                                   return VM_FAULT_NOPAGE;
                                 }
                               }
                               /*
                                * All good up to here: enclave page
                                * successfully added to enclave,
                                * ready for EACCEPT from user space
                                */
    mutex_lock(&encl->lock);
    /*
     * alloc EPC page
     */
    epc_page = sgx_alloc_epc_page(...);
    /*
     * add page to enclave's xarray,
     * this fails with -EBUSY as this
     * page was already added by CPU2
     */
    xa_insert(&encl->page_array, ...);

  err_out_shrink:
    sgx_encl_free_epc_page(epc_page) {
      /*
       * remove page via EREMOVE
       *
       * *BUG*: page added by CPU2 is
       * yanked from enclave while it
       * remains accessible from OS
       * perspective (PTE installed)
       */
      /*
       * free EPC page
       */
      sgx_free_epc_page(epc_page);
    }

    mutex_unlock(&encl->lock);
    /*
     * *BUG*: SIGBUS is returned
     * for a valid enclave page
     */
    return VM_FAULT_SIGBUS;
  }
}

The err_out_shrink error handling path contains two bugs: (1) function
sgx_encl_free_epc_page() is called that performs EREMOVE even though the
enclave page was never intended to be removed, and (2) SIGBUS is sent to
userspace even though the enclave page is correctly installed by another
thread.

The first bug renders the enclave page perpetually inaccessible (until
another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
marked accessible in the PTE entry but is not EAUGed, and any subsequent
access to this page raises a fault: with the kernel believing there to
be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
path do_user_addr_fault() -> access_error() causes the SGX driver's
sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
The userspace SIGSEGV handler cannot perform EACCEPT because the page
was not EAUGed. Thus, the user space is stuck with the inaccessible
page. The second bug is less severe: a spurious SIGBUS signal is
unnecessarily sent to user space.

Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
fault handler so that no signal is sent to userspace, and (2) by
replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
EREMOVE is performed.

Note that sgx_encl_free_epc_page() performs an additional WARN_ON_ONCE
check in comparison to sgx_free_epc_page(): whether the EPC page is
being reclaimer tracked. However, the EPC page is allocated in
sgx_encl_eaug_page() and has zeroed-out flags in all error handling
paths. In other words, the page is marked as reclaimable only in the
happy path of sgx_encl_eaug_page(). Therefore, in the particular code
path affected in this commit, the "page reclaimer tracked" condition is
always false and the warning is never printed. Thus, it is safe to
replace sgx_encl_free_epc_page() with sgx_free_epc_page().

Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
Cc: stable@vger.kernel.org
Reported-by: Marcelina Kościelnicka <mwk@invisiblethingslab.com>
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
---
 arch/x86/kernel/cpu/sgx/encl.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..41f14b1a3025 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	 * If ret == -EBUSY then page was created in another flow while
 	 * running without encl->lock
 	 */
-	if (ret)
+	if (ret) {
+		if (ret == -EBUSY)
+			vmret = VM_FAULT_NOPAGE;
 		goto err_out_shrink;
+	}

 	pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
 	pginfo.addr = encl_page->desc & PAGE_MASK;
@@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 err_out_shrink:
 	sgx_encl_shrink(encl, va_page);
 err_out_epc:
-	sgx_encl_free_epc_page(epc_page);
+	sgx_free_epc_page(epc_page);
 err_out_unlock:
 	mutex_unlock(&encl->lock);
 	kfree(encl_page);
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race
  2024-05-15 13:12 [PATCH v2 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows Dmitrii Kuvaiskii
  2024-05-15 13:12 ` [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
@ 2024-05-15 13:12 ` Dmitrii Kuvaiskii
  2024-05-15 14:44   ` Jarkko Sakkinen
                     ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-05-15 13:12 UTC (permalink / raw)
  To: dave.hansen, jarkko, kai.huang, haitao.huang, reinette.chatre,
	linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin, stable

Two enclave threads may try to add and remove the same enclave page
simultaneously (e.g., if the SGX runtime supports both lazy allocation
and MADV_DONTNEED semantics). Consider some enclave page added to the
enclave. User space decides to temporarily remove this page (e.g.,
emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
space performs a memory access on the same page on CPU2, which results
in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
follows:

/*
 * CPU1: User space performs
 * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
 * on enclave page X
 */
sgx_encl_remove_pages() {

  mutex_lock(&encl->lock);

  entry = sgx_encl_load_page(encl);
  /*
   * verify that page is
   * trimmed and accepted
   */

  mutex_unlock(&encl->lock);

  /*
   * remove PTE entry; cannot
   * be performed under lock
   */
  sgx_zap_enclave_ptes(encl);
                                 /*
                                  * Fault on CPU2 on same page X
                                  */
                                 sgx_vma_fault() {
                                   /*
                                    * PTE entry was removed, but the
                                    * page is still in enclave's xarray
                                    */
                                   xa_load(&encl->page_array) != NULL ->
                                   /*
                                    * SGX driver thinks that this page
                                    * was swapped out and loads it
                                    */
                                   mutex_lock(&encl->lock);
                                   /*
                                    * this is effectively a no-op
                                    */
                                   entry = sgx_encl_load_page_in_vma();
                                   /*
                                    * add PTE entry
                                    *
                                    * *BUG*: a PTE is installed for a
                                    * page in process of being removed
                                    */
                                   vmf_insert_pfn(...);

                                   mutex_unlock(&encl->lock);
                                   return VM_FAULT_NOPAGE;
                                 }
  /*
   * continue with page removal
   */
  mutex_lock(&encl->lock);

  sgx_encl_free_epc_page(epc_page) {
    /*
     * remove page via EREMOVE
     */
    /*
     * free EPC page
     */
    sgx_free_epc_page(epc_page);
  }

  xa_erase(&encl->page_array);

  mutex_unlock(&encl->lock);
}

Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
same page. This enclave page becomes perpetually inaccessible (until
another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
marked accessible in the PTE entry but is not EAUGed, and any subsequent
access to this page raises a fault: with the kernel believing there to
be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
path do_user_addr_fault() -> access_error() causes the SGX driver's
sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
The userspace SIGSEGV handler cannot perform EACCEPT because the page
was not EAUGed. Thus, the user space is stuck with the inaccessible
page.

Fix this race by forcing the fault handler on CPU2 to back off if the
page is currently being removed (on CPU1). This is achieved by
introducing a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by
default and set only right-before the first mutex_unlock() in
sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
page is being removed, and if yes then CPU2 backs off and waits until
the page is completely removed. After that, any memory access to this
page results in a normal "allocate and EAUG a page on #PF" flow.

Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
---
 arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
 arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 41f14b1a3025..7ccd8b2fce5f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 
 	/* Entry successfully located. */
 	if (entry->epc_page) {
-		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+		if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
+				   SGX_ENCL_PAGE_BEING_REMOVED))
 			return ERR_PTR(-EBUSY);
 
 		return entry;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fff5f2293ae7 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -25,6 +25,9 @@
 /* 'desc' bit marking that the page is being reclaimed. */
 #define SGX_ENCL_PAGE_BEING_RECLAIMED	BIT(3)
 
+/* 'desc' bit marking that the page is being removed. */
+#define SGX_ENCL_PAGE_BEING_REMOVED	BIT(2)
+
 struct sgx_encl_page {
 	unsigned long desc;
 	unsigned long vm_max_prot_bits:8;
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..c542d4dd3e64 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
 		 * Do not keep encl->lock because of dependency on
 		 * mmap_lock acquired in sgx_zap_enclave_ptes().
 		 */
+		entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
 		mutex_unlock(&encl->lock);
 
 		sgx_zap_enclave_ptes(encl, addr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  2024-05-15 13:12 ` [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
@ 2024-05-15 13:54   ` Jarkko Sakkinen
  2024-05-15 13:56     ` Jarkko Sakkinen
  2024-05-15 14:15     ` Dave Hansen
  2024-05-15 15:58   ` Reinette Chatre
  2024-05-15 22:01   ` Haitao Huang
  2 siblings, 2 replies; 13+ messages in thread
From: Jarkko Sakkinen @ 2024-05-15 13:54 UTC (permalink / raw)
  To: Dmitrii Kuvaiskii, dave.hansen, kai.huang, haitao.huang,
	reinette.chatre, linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka

On Wed May 15, 2024 at 4:12 PM EEST, Dmitrii Kuvaiskii wrote:
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..41f14b1a3025 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
>  	 * If ret == -EBUSY then page was created in another flow while
>  	 * running without encl->lock
>  	 */
> -	if (ret)
> +	if (ret) {
> +		if (ret == -EBUSY)
> +			vmret = VM_FAULT_NOPAGE;
>  		goto err_out_shrink;
> +	}

I agree that there is a bug but it does not categorize as race
condition.

The bug is simply that for a valid page SIGBUS might be returned.
The fix is correct but the claim is not.

>  
>  	pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
>  	pginfo.addr = encl_page->desc & PAGE_MASK;
> @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
>  err_out_shrink:
>  	sgx_encl_shrink(encl, va_page);
>  err_out_epc:
> -	sgx_encl_free_epc_page(epc_page);
> +	sgx_free_epc_page(epc_page);
>  err_out_unlock:
>  	mutex_unlock(&encl->lock);
>  	kfree(encl_page);

Agree with code change 100% but not with the description.

I'd cut out 90% of the description out and just make the argument of
the wrong error code, and done. The sequence is great for showing
how this could happen. The prose makes my head hurt tbh.

BR, Jarkko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  2024-05-15 13:54   ` Jarkko Sakkinen
@ 2024-05-15 13:56     ` Jarkko Sakkinen
  2024-05-15 14:15     ` Dave Hansen
  1 sibling, 0 replies; 13+ messages in thread
From: Jarkko Sakkinen @ 2024-05-15 13:56 UTC (permalink / raw)
  To: Jarkko Sakkinen, Dmitrii Kuvaiskii, dave.hansen, kai.huang,
	haitao.huang, reinette.chatre, linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka

On Wed May 15, 2024 at 4:54 PM EEST, Jarkko Sakkinen wrote:
> On Wed May 15, 2024 at 4:12 PM EEST, Dmitrii Kuvaiskii wrote:
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> > index 279148e72459..41f14b1a3025 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
> >  	 * If ret == -EBUSY then page was created in another flow while
> >  	 * running without encl->lock
> >  	 */
> > -	if (ret)
> > +	if (ret) {
> > +		if (ret == -EBUSY)
> > +			vmret = VM_FAULT_NOPAGE;
> >  		goto err_out_shrink;
> > +	}
>
> I agree that there is a bug but it does not categorize as race
> condition.
>
> The bug is simply that for a valid page SIGBUS might be returned.
> The fix is correct but the claim is not.
>
> >  
> >  	pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
> >  	pginfo.addr = encl_page->desc & PAGE_MASK;
> > @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
> >  err_out_shrink:
> >  	sgx_encl_shrink(encl, va_page);
> >  err_out_epc:
> > -	sgx_encl_free_epc_page(epc_page);
> > +	sgx_free_epc_page(epc_page);
> >  err_out_unlock:
> >  	mutex_unlock(&encl->lock);
> >  	kfree(encl_page);
>
> Agree with code change 100% but not with the description.
>
> I'd cut out 90% of the description out and just make the argument of
> the wrong error code, and done. The sequence is great for showing
> how this could happen. The prose makes my head hurt tbh.

Also please remember that stable maintainers need to read all of that
if this is a bug fix (it is a bug fix!) :-) So shorted possible legit
argument, no prose and the sequence was awesome :-)

BR, Jarkko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  2024-05-15 13:54   ` Jarkko Sakkinen
  2024-05-15 13:56     ` Jarkko Sakkinen
@ 2024-05-15 14:15     ` Dave Hansen
  2024-05-15 14:28       ` Jarkko Sakkinen
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2024-05-15 14:15 UTC (permalink / raw)
  To: Jarkko Sakkinen, Dmitrii Kuvaiskii, dave.hansen, kai.huang,
	haitao.huang, reinette.chatre, linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka

On 5/15/24 06:54, Jarkko Sakkinen wrote:
> I'd cut out 90% of the description out and just make the argument of
> the wrong error code, and done. The sequence is great for showing
> how this could happen. The prose makes my head hurt tbh.

The changelog is too long, but not fatally so.  I'd much rather have a
super verbose description than something super sparse.

Would something like this make more sense to folks?

	Imagine an mmap()'d file. Two threads touch the same address at
	the same time and fault. Both allocate a physical page and race
	to install a PTE for that page. Only one will win the race. The
	loser frees its page, but still continues handling the fault as
	a success and returns VM_FAULT_NOPAGE from the fault handler.

	The same race can happen with SGX. But there's a bug: the loser
	in the SGX steers into a failure path. The loser EREMOVE's the
	winner's EPC page, then returns SIGBUS, likely killing the app.

	Fix the SGX loser's behavior. Change the return code to
	VM_FAULT_NOPAGE to avoid SIGBUS and call sgx_free_epc_page()
	which avoids EREMOVE'ing the winner's page and only frees the
	page that the loser allocated.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  2024-05-15 14:15     ` Dave Hansen
@ 2024-05-15 14:28       ` Jarkko Sakkinen
  0 siblings, 0 replies; 13+ messages in thread
From: Jarkko Sakkinen @ 2024-05-15 14:28 UTC (permalink / raw)
  To: Dave Hansen, Dmitrii Kuvaiskii, dave.hansen, kai.huang,
	haitao.huang, reinette.chatre, linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka

On Wed May 15, 2024 at 5:15 PM EEST, Dave Hansen wrote:
> On 5/15/24 06:54, Jarkko Sakkinen wrote:
> > I'd cut out 90% of the description out and just make the argument of
> > the wrong error code, and done. The sequence is great for showing
> > how this could happen. The prose makes my head hurt tbh.
>
> The changelog is too long, but not fatally so.  I'd much rather have a
> super verbose description than something super sparse.
>
> Would something like this make more sense to folks?
>
> 	Imagine an mmap()'d file. Two threads touch the same address at
> 	the same time and fault. Both allocate a physical page and race
> 	to install a PTE for that page. Only one will win the race. The
> 	loser frees its page, but still continues handling the fault as
> 	a success and returns VM_FAULT_NOPAGE from the fault handler.
>
> 	The same race can happen with SGX. But there's a bug: the loser
> 	in the SGX steers into a failure path. The loser EREMOVE's the
> 	winner's EPC page, then returns SIGBUS, likely killing the app.
>
> 	Fix the SGX loser's behavior. Change the return code to
> 	VM_FAULT_NOPAGE to avoid SIGBUS and call sgx_free_epc_page()
> 	which avoids EREMOVE'ing the winner's page and only frees the
> 	page that the loser allocated.

Yes!

I did read the whole thing. My comment was only related to the
chain of maintainers who also have to deal with this patch
eventually.

BR, Jarkko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race
  2024-05-15 13:12 ` [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
@ 2024-05-15 14:44   ` Jarkko Sakkinen
  2024-05-15 15:52   ` Reinette Chatre
  2024-05-15 21:59   ` Haitao Huang
  2 siblings, 0 replies; 13+ messages in thread
From: Jarkko Sakkinen @ 2024-05-15 14:44 UTC (permalink / raw)
  To: Dmitrii Kuvaiskii, dave.hansen, kai.huang, haitao.huang,
	reinette.chatre, linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin, stable

On Wed May 15, 2024 at 4:12 PM EEST, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to add and remove the same enclave page
> simultaneously (e.g., if the SGX runtime supports both lazy allocation
> and MADV_DONTNEED semantics). Consider some enclave page added to the
> enclave. User space decides to temporarily remove this page (e.g.,
> emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
> space performs a memory access on the same page on CPU2, which results
> in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
> follows:
>
> /*
>  * CPU1: User space performs
>  * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
>  * on enclave page X
>  */
> sgx_encl_remove_pages() {
>
>   mutex_lock(&encl->lock);
>
>   entry = sgx_encl_load_page(encl);
>   /*
>    * verify that page is
>    * trimmed and accepted
>    */
>
>   mutex_unlock(&encl->lock);
>
>   /*
>    * remove PTE entry; cannot
>    * be performed under lock
>    */
>   sgx_zap_enclave_ptes(encl);
>                                  /*
>                                   * Fault on CPU2 on same page X
>                                   */
>                                  sgx_vma_fault() {
>                                    /*
>                                     * PTE entry was removed, but the
>                                     * page is still in enclave's xarray
>                                     */
>                                    xa_load(&encl->page_array) != NULL ->
>                                    /*
>                                     * SGX driver thinks that this page
>                                     * was swapped out and loads it
>                                     */
>                                    mutex_lock(&encl->lock);
>                                    /*
>                                     * this is effectively a no-op
>                                     */
>                                    entry = sgx_encl_load_page_in_vma();
>                                    /*
>                                     * add PTE entry
>                                     *
>                                     * *BUG*: a PTE is installed for a
>                                     * page in process of being removed
>                                     */
>                                    vmf_insert_pfn(...);
>
>                                    mutex_unlock(&encl->lock);
>                                    return VM_FAULT_NOPAGE;
>                                  }
>   /*
>    * continue with page removal
>    */
>   mutex_lock(&encl->lock);
>
>   sgx_encl_free_epc_page(epc_page) {
>     /*
>      * remove page via EREMOVE
>      */
>     /*
>      * free EPC page
>      */
>     sgx_free_epc_page(epc_page);
>   }
>
>   xa_erase(&encl->page_array);
>
>   mutex_unlock(&encl->lock);
> }
>
> Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
> same page. This enclave page becomes perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed, and any subsequent
> access to this page raises a fault: with the kernel believing there to
> be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
> path do_user_addr_fault() -> access_error() causes the SGX driver's
> sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
> The userspace SIGSEGV handler cannot perform EACCEPT because the page
> was not EAUGed. Thus, the user space is stuck with the inaccessible
> page.
>
> Fix this race by forcing the fault handler on CPU2 to back off if the
> page is currently being removed (on CPU1). This is achieved by
> introducing a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by
> default and set only right-before the first mutex_unlock() in
> sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
> page is being removed, and if yes then CPU2 backs off and waits until
> the page is completely removed. After that, any memory access to this
> page results in a normal "allocate and EAUG a page on #PF" flow.
>
> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
>  arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
>  3 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 41f14b1a3025..7ccd8b2fce5f 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>  
>  	/* Entry successfully located. */
>  	if (entry->epc_page) {
> -		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> +		if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
> +				   SGX_ENCL_PAGE_BEING_REMOVED))
>  			return ERR_PTR(-EBUSY);
>  
>  		return entry;
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index f94ff14c9486..fff5f2293ae7 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -25,6 +25,9 @@
>  /* 'desc' bit marking that the page is being reclaimed. */
>  #define SGX_ENCL_PAGE_BEING_RECLAIMED	BIT(3)
>  
> +/* 'desc' bit marking that the page is being removed. */
> +#define SGX_ENCL_PAGE_BEING_REMOVED	BIT(2)
> +
>  struct sgx_encl_page {
>  	unsigned long desc;
>  	unsigned long vm_max_prot_bits:8;
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..c542d4dd3e64 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>  		 * Do not keep encl->lock because of dependency on
>  		 * mmap_lock acquired in sgx_zap_enclave_ptes().
>  		 */
> +		entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>  		mutex_unlock(&encl->lock);
>  
>  		sgx_zap_enclave_ptes(encl, addr);

Makes perfect sense:

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race
  2024-05-15 13:12 ` [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
  2024-05-15 14:44   ` Jarkko Sakkinen
@ 2024-05-15 15:52   ` Reinette Chatre
  2024-05-15 21:59   ` Haitao Huang
  2 siblings, 0 replies; 13+ messages in thread
From: Reinette Chatre @ 2024-05-15 15:52 UTC (permalink / raw)
  To: Dmitrii Kuvaiskii, dave.hansen, jarkko, kai.huang, haitao.huang,
	linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin, stable

Hi Dmitrii,

On 5/15/2024 6:12 AM, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to add and remove the same enclave page
> simultaneously (e.g., if the SGX runtime supports both lazy allocation
> and MADV_DONTNEED semantics). Consider some enclave page added to the
> enclave. User space decides to temporarily remove this page (e.g.,
> emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
> space performs a memory access on the same page on CPU2, which results
> in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
> follows:
> 
> /*
>  * CPU1: User space performs
>  * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
>  * on enclave page X
>  */
> sgx_encl_remove_pages() {
> 
>   mutex_lock(&encl->lock);
> 
>   entry = sgx_encl_load_page(encl);
>   /*
>    * verify that page is
>    * trimmed and accepted
>    */
> 
>   mutex_unlock(&encl->lock);
> 
>   /*
>    * remove PTE entry; cannot
>    * be performed under lock
>    */
>   sgx_zap_enclave_ptes(encl);
>                                  /*
>                                   * Fault on CPU2 on same page X
>                                   */
>                                  sgx_vma_fault() {
>                                    /*
>                                     * PTE entry was removed, but the
>                                     * page is still in enclave's xarray
>                                     */
>                                    xa_load(&encl->page_array) != NULL ->
>                                    /*
>                                     * SGX driver thinks that this page
>                                     * was swapped out and loads it
>                                     */
>                                    mutex_lock(&encl->lock);
>                                    /*
>                                     * this is effectively a no-op
>                                     */
>                                    entry = sgx_encl_load_page_in_vma();
>                                    /*
>                                     * add PTE entry
>                                     *
>                                     * *BUG*: a PTE is installed for a
>                                     * page in process of being removed
>                                     */
>                                    vmf_insert_pfn(...);
> 
>                                    mutex_unlock(&encl->lock);
>                                    return VM_FAULT_NOPAGE;
>                                  }
>   /*
>    * continue with page removal
>    */
>   mutex_lock(&encl->lock);
> 
>   sgx_encl_free_epc_page(epc_page) {
>     /*
>      * remove page via EREMOVE
>      */
>     /*
>      * free EPC page
>      */
>     sgx_free_epc_page(epc_page);
>   }
> 
>   xa_erase(&encl->page_array);
> 
>   mutex_unlock(&encl->lock);
> }
> 
> Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
> same page. This enclave page becomes perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed, and any subsequent
> access to this page raises a fault: with the kernel believing there to
> be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
> path do_user_addr_fault() -> access_error() causes the SGX driver's
> sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
> The userspace SIGSEGV handler cannot perform EACCEPT because the page
> was not EAUGed. Thus, the user space is stuck with the inaccessible
> page.
> 
> Fix this race by forcing the fault handler on CPU2 to back off if the
> page is currently being removed (on CPU1). This is achieved by
> introducing a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by
> default and set only right-before the first mutex_unlock() in
> sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
> page is being removed, and if yes then CPU2 backs off and waits until
> the page is completely removed. After that, any memory access to this
> page results in a normal "allocate and EAUG a page on #PF" flow.
> 
> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
>  arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
>  3 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 41f14b1a3025..7ccd8b2fce5f 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>  
>  	/* Entry successfully located. */
>  	if (entry->epc_page) {
> -		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> +		if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
> +				   SGX_ENCL_PAGE_BEING_REMOVED))
>  			return ERR_PTR(-EBUSY);
>  
>  		return entry;
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index f94ff14c9486..fff5f2293ae7 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -25,6 +25,9 @@
>  /* 'desc' bit marking that the page is being reclaimed. */
>  #define SGX_ENCL_PAGE_BEING_RECLAIMED	BIT(3)
>  
> +/* 'desc' bit marking that the page is being removed. */
> +#define SGX_ENCL_PAGE_BEING_REMOVED	BIT(2)
> +
>  struct sgx_encl_page {
>  	unsigned long desc;
>  	unsigned long vm_max_prot_bits:8;
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..c542d4dd3e64 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>  		 * Do not keep encl->lock because of dependency on
>  		 * mmap_lock acquired in sgx_zap_enclave_ptes().
>  		 */
> +		entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>  		mutex_unlock(&encl->lock);
>  
>  		sgx_zap_enclave_ptes(encl, addr);

Thank you very much for tracking down and fixing this issue.

Acked-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  2024-05-15 13:12 ` [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
  2024-05-15 13:54   ` Jarkko Sakkinen
@ 2024-05-15 15:58   ` Reinette Chatre
  2024-05-15 16:39     ` Jarkko Sakkinen
  2024-05-15 22:01   ` Haitao Huang
  2 siblings, 1 reply; 13+ messages in thread
From: Reinette Chatre @ 2024-05-15 15:58 UTC (permalink / raw)
  To: Dmitrii Kuvaiskii, dave.hansen, jarkko, kai.huang, haitao.huang,
	linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka

Hi Dmitrii,

On 5/15/2024 6:12 AM, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to access the same non-present enclave page
> simultaneously (e.g., if the SGX runtime supports lazy allocation). The
> threads will end up in sgx_encl_eaug_page(), racing to acquire the
> enclave lock. The winning thread will perform EAUG, set up the page
> table entry, and insert the page into encl->page_array. The losing
> thread will then get -EBUSY on xa_insert(&encl->page_array) and proceed
> to error handling path.
> 
> This race condition can be illustrated as follows:
> 
> /*                             /*
>  * Fault on CPU1                * Fault on CPU2
>  * on enclave page X            * on enclave page X
>  */                             */
> sgx_vma_fault() {              sgx_vma_fault() {
> 
>   xa_load(&encl->page_array)     xa_load(&encl->page_array)
>       == NULL -->                    == NULL -->
> 
>   sgx_encl_eaug_page() {         sgx_encl_eaug_page() {
> 
>     ...                            ...
> 
>     /*                             /*
>      * alloc encl_page              * alloc encl_page
>      */                             */
>                                    mutex_lock(&encl->lock);
>                                    /*
>                                     * alloc EPC page
>                                     */
>                                    epc_page = sgx_alloc_epc_page(...);
>                                    /*
>                                     * add page to enclave's xarray
>                                     */
>                                    xa_insert(&encl->page_array, ...);
>                                    /*
>                                     * add page to enclave via EAUG
>                                     * (page is in pending state)
>                                     */
>                                    /*
>                                     * add PTE entry
>                                     */
>                                    vmf_insert_pfn(...);
> 
>                                    mutex_unlock(&encl->lock);
>                                    return VM_FAULT_NOPAGE;
>                                  }
>                                }
>                                /*
>                                 * All good up to here: enclave page
>                                 * successfully added to enclave,
>                                 * ready for EACCEPT from user space
>                                 */
>     mutex_lock(&encl->lock);
>     /*
>      * alloc EPC page
>      */
>     epc_page = sgx_alloc_epc_page(...);
>     /*
>      * add page to enclave's xarray,
>      * this fails with -EBUSY as this
>      * page was already added by CPU2
>      */
>     xa_insert(&encl->page_array, ...);
> 
>   err_out_shrink:
>     sgx_encl_free_epc_page(epc_page) {
>       /*
>        * remove page via EREMOVE
>        *
>        * *BUG*: page added by CPU2 is
>        * yanked from enclave while it
>        * remains accessible from OS
>        * perspective (PTE installed)
>        */
>       /*
>        * free EPC page
>        */
>       sgx_free_epc_page(epc_page);
>     }
> 
>     mutex_unlock(&encl->lock);
>     /*
>      * *BUG*: SIGBUS is returned
>      * for a valid enclave page
>      */
>     return VM_FAULT_SIGBUS;
>   }
> }
> 
> The err_out_shrink error handling path contains two bugs: (1) function
> sgx_encl_free_epc_page() is called that performs EREMOVE even though the
> enclave page was never intended to be removed, and (2) SIGBUS is sent to
> userspace even though the enclave page is correctly installed by another
> thread.
> 
> The first bug renders the enclave page perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed, and any subsequent
> access to this page raises a fault: with the kernel believing there to
> be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
> path do_user_addr_fault() -> access_error() causes the SGX driver's
> sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
> The userspace SIGSEGV handler cannot perform EACCEPT because the page
> was not EAUGed. Thus, the user space is stuck with the inaccessible
> page. The second bug is less severe: a spurious SIGBUS signal is
> unnecessarily sent to user space.
> 
> Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> fault handler so that no signal is sent to userspace, and (2) by
> replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> EREMOVE is performed.
> 
> Note that sgx_encl_free_epc_page() performs an additional WARN_ON_ONCE
> check in comparison to sgx_free_epc_page(): whether the EPC page is
> being reclaimer tracked. However, the EPC page is allocated in
> sgx_encl_eaug_page() and has zeroed-out flags in all error handling
> paths. In other words, the page is marked as reclaimable only in the
> happy path of sgx_encl_eaug_page(). Therefore, in the particular code
> path affected in this commit, the "page reclaimer tracked" condition is
> always false and the warning is never printed. Thus, it is safe to
> replace sgx_encl_free_epc_page() with sgx_free_epc_page().
> 
> Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
> Cc: stable@vger.kernel.org
> Reported-by: Marcelina Kościelnicka <mwk@invisiblethingslab.com>
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/encl.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..41f14b1a3025 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
>  	 * If ret == -EBUSY then page was created in another flow while
>  	 * running without encl->lock
>  	 */
> -	if (ret)
> +	if (ret) {
> +		if (ret == -EBUSY)
> +			vmret = VM_FAULT_NOPAGE;
>  		goto err_out_shrink;
> +	}
>  
>  	pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
>  	pginfo.addr = encl_page->desc & PAGE_MASK;
> @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
>  err_out_shrink:
>  	sgx_encl_shrink(encl, va_page);
>  err_out_epc:
> -	sgx_encl_free_epc_page(epc_page);
> +	sgx_free_epc_page(epc_page);
>  err_out_unlock:
>  	mutex_unlock(&encl->lock);
>  	kfree(encl_page);

Thank you very much. I understand the changelog is still being discussed
and those changes look good to me, to which you can add:

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  2024-05-15 15:58   ` Reinette Chatre
@ 2024-05-15 16:39     ` Jarkko Sakkinen
  0 siblings, 0 replies; 13+ messages in thread
From: Jarkko Sakkinen @ 2024-05-15 16:39 UTC (permalink / raw)
  To: Reinette Chatre, Dmitrii Kuvaiskii, dave.hansen, kai.huang,
	haitao.huang, linux-sgx, linux-kernel
  Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka

> Thank you very much. I understand the changelog is still being discussed
> and those changes look good to me, to which you can add:
>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

also for this (with changelog tweak Dave suggested) so that we don't
need a new round:

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race
  2024-05-15 13:12 ` [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
  2024-05-15 14:44   ` Jarkko Sakkinen
  2024-05-15 15:52   ` Reinette Chatre
@ 2024-05-15 21:59   ` Haitao Huang
  2 siblings, 0 replies; 13+ messages in thread
From: Haitao Huang @ 2024-05-15 21:59 UTC (permalink / raw)
  To: dave.hansen, jarkko, kai.huang, reinette.chatre, linux-sgx,
	linux-kernel, Dmitrii Kuvaiskii
  Cc: mona.vij, kailun.qin, stable

On Wed, 15 May 2024 08:12:40 -0500, Dmitrii Kuvaiskii  
<dmitrii.kuvaiskii@intel.com> wrote:

> Two enclave threads may try to add and remove the same enclave page
> simultaneously (e.g., if the SGX runtime supports both lazy allocation
> and MADV_DONTNEED semantics). Consider some enclave page added to the
> enclave. User space decides to temporarily remove this page (e.g.,
> emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
> space performs a memory access on the same page on CPU2, which results
> in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
> follows:
>
> /*
>  * CPU1: User space performs
>  * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
>  * on enclave page X
>  */
> sgx_encl_remove_pages() {
>
>   mutex_lock(&encl->lock);
>
>   entry = sgx_encl_load_page(encl);
>   /*
>    * verify that page is
>    * trimmed and accepted
>    */
>
>   mutex_unlock(&encl->lock);
>
>   /*
>    * remove PTE entry; cannot
>    * be performed under lock
>    */
>   sgx_zap_enclave_ptes(encl);
>                                  /*
>                                   * Fault on CPU2 on same page X
>                                   */
>                                  sgx_vma_fault() {
>                                    /*
>                                     * PTE entry was removed, but the
>                                     * page is still in enclave's xarray
>                                     */
>                                    xa_load(&encl->page_array) != NULL ->
>                                    /*
>                                     * SGX driver thinks that this page
>                                     * was swapped out and loads it
>                                     */
>                                    mutex_lock(&encl->lock);
>                                    /*
>                                     * this is effectively a no-op
>                                     */
>                                    entry = sgx_encl_load_page_in_vma();
>                                    /*
>                                     * add PTE entry
>                                     *
>                                     * *BUG*: a PTE is installed for a
>                                     * page in process of being removed
>                                     */
>                                    vmf_insert_pfn(...);
>
>                                    mutex_unlock(&encl->lock);
>                                    return VM_FAULT_NOPAGE;
>                                  }
>   /*
>    * continue with page removal
>    */
>   mutex_lock(&encl->lock);
>
>   sgx_encl_free_epc_page(epc_page) {
>     /*
>      * remove page via EREMOVE
>      */
>     /*
>      * free EPC page
>      */
>     sgx_free_epc_page(epc_page);
>   }
>
>   xa_erase(&encl->page_array);
>
>   mutex_unlock(&encl->lock);
> }
>
> Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
> same page. This enclave page becomes perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed, and any subsequent
> access to this page raises a fault: with the kernel believing there to
> be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
> path do_user_addr_fault() -> access_error() causes the SGX driver's
> sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
> The userspace SIGSEGV handler cannot perform EACCEPT because the page
> was not EAUGed. Thus, the user space is stuck with the inaccessible
> page.
>
> Fix this race by forcing the fault handler on CPU2 to back off if the
> page is currently being removed (on CPU1). This is achieved by
> introducing a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by
> default and set only right-before the first mutex_unlock() in
> sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
> page is being removed, and if yes then CPU2 backs off and waits until
> the page is completely removed. After that, any memory access to this
> page results in a normal "allocate and EAUG a page on #PF" flow.
>
> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
>  arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
>  3 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c  
> b/arch/x86/kernel/cpu/sgx/encl.c
> index 41f14b1a3025..7ccd8b2fce5f 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -257,7 +257,8 @@ static struct sgx_encl_page  
> *__sgx_encl_load_page(struct sgx_encl *encl,
> 	/* Entry successfully located. */
>  	if (entry->epc_page) {
> -		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> +		if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
> +				   SGX_ENCL_PAGE_BEING_REMOVED))
>  			return ERR_PTR(-EBUSY);
> 		return entry;
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h  
> b/arch/x86/kernel/cpu/sgx/encl.h
> index f94ff14c9486..fff5f2293ae7 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -25,6 +25,9 @@
>  /* 'desc' bit marking that the page is being reclaimed. */
>  #define SGX_ENCL_PAGE_BEING_RECLAIMED	BIT(3)
> +/* 'desc' bit marking that the page is being removed. */
> +#define SGX_ENCL_PAGE_BEING_REMOVED	BIT(2)
> +
>  struct sgx_encl_page {
>  	unsigned long desc;
>  	unsigned long vm_max_prot_bits:8;
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c  
> b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..c542d4dd3e64 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl  
> *encl,
>  		 * Do not keep encl->lock because of dependency on
>  		 * mmap_lock acquired in sgx_zap_enclave_ptes().
>  		 */
> +		entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>  		mutex_unlock(&encl->lock);
> 		sgx_zap_enclave_ptes(encl, addr);



Reviewed-by: Haitao Huang <haitao.huang@linux.intel.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  2024-05-15 13:12 ` [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
  2024-05-15 13:54   ` Jarkko Sakkinen
  2024-05-15 15:58   ` Reinette Chatre
@ 2024-05-15 22:01   ` Haitao Huang
  2 siblings, 0 replies; 13+ messages in thread
From: Haitao Huang @ 2024-05-15 22:01 UTC (permalink / raw)
  To: dave.hansen, jarkko, kai.huang, reinette.chatre, linux-sgx,
	linux-kernel, Dmitrii Kuvaiskii
  Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka

On Wed, 15 May 2024 08:12:39 -0500, Dmitrii Kuvaiskii  
<dmitrii.kuvaiskii@intel.com> wrote:

> Two enclave threads may try to access the same non-present enclave page
> simultaneously (e.g., if the SGX runtime supports lazy allocation). The
> threads will end up in sgx_encl_eaug_page(), racing to acquire the
> enclave lock. The winning thread will perform EAUG, set up the page
> table entry, and insert the page into encl->page_array. The losing
> thread will then get -EBUSY on xa_insert(&encl->page_array) and proceed
> to error handling path.
>
> This race condition can be illustrated as follows:
>
> /*                             /*
>  * Fault on CPU1                * Fault on CPU2
>  * on enclave page X            * on enclave page X
>  */                             */
> sgx_vma_fault() {              sgx_vma_fault() {
>
>   xa_load(&encl->page_array)     xa_load(&encl->page_array)
>       == NULL -->                    == NULL -->
>
>   sgx_encl_eaug_page() {         sgx_encl_eaug_page() {
>
>     ...                            ...
>
>     /*                             /*
>      * alloc encl_page              * alloc encl_page
>      */                             */
>                                    mutex_lock(&encl->lock);
>                                    /*
>                                     * alloc EPC page
>                                     */
>                                    epc_page = sgx_alloc_epc_page(...);
>                                    /*
>                                     * add page to enclave's xarray
>                                     */
>                                    xa_insert(&encl->page_array, ...);
>                                    /*
>                                     * add page to enclave via EAUG
>                                     * (page is in pending state)
>                                     */
>                                    /*
>                                     * add PTE entry
>                                     */
>                                    vmf_insert_pfn(...);
>
>                                    mutex_unlock(&encl->lock);
>                                    return VM_FAULT_NOPAGE;
>                                  }
>                                }
>                                /*
>                                 * All good up to here: enclave page
>                                 * successfully added to enclave,
>                                 * ready for EACCEPT from user space
>                                 */
>     mutex_lock(&encl->lock);
>     /*
>      * alloc EPC page
>      */
>     epc_page = sgx_alloc_epc_page(...);
>     /*
>      * add page to enclave's xarray,
>      * this fails with -EBUSY as this
>      * page was already added by CPU2
>      */
>     xa_insert(&encl->page_array, ...);
>
>   err_out_shrink:
>     sgx_encl_free_epc_page(epc_page) {
>       /*
>        * remove page via EREMOVE
>        *
>        * *BUG*: page added by CPU2 is
>        * yanked from enclave while it
>        * remains accessible from OS
>        * perspective (PTE installed)
>        */
>       /*
>        * free EPC page
>        */
>       sgx_free_epc_page(epc_page);
>     }
>
>     mutex_unlock(&encl->lock);
>     /*
>      * *BUG*: SIGBUS is returned
>      * for a valid enclave page
>      */
>     return VM_FAULT_SIGBUS;
>   }
> }
>
> The err_out_shrink error handling path contains two bugs: (1) function
> sgx_encl_free_epc_page() is called that performs EREMOVE even though the
> enclave page was never intended to be removed, and (2) SIGBUS is sent to
> userspace even though the enclave page is correctly installed by another
> thread.
>
> The first bug renders the enclave page perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed, and any subsequent
> access to this page raises a fault: with the kernel believing there to
> be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
> path do_user_addr_fault() -> access_error() causes the SGX driver's
> sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
> The userspace SIGSEGV handler cannot perform EACCEPT because the page
> was not EAUGed. Thus, the user space is stuck with the inaccessible
> page. The second bug is less severe: a spurious SIGBUS signal is
> unnecessarily sent to user space.
>
> Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> fault handler so that no signal is sent to userspace, and (2) by
> replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> EREMOVE is performed.
>
> Note that sgx_encl_free_epc_page() performs an additional WARN_ON_ONCE
> check in comparison to sgx_free_epc_page(): whether the EPC page is
> being reclaimer tracked. However, the EPC page is allocated in
> sgx_encl_eaug_page() and has zeroed-out flags in all error handling
> paths. In other words, the page is marked as reclaimable only in the
> happy path of sgx_encl_eaug_page(). Therefore, in the particular code
> path affected in this commit, the "page reclaimer tracked" condition is
> always false and the warning is never printed. Thus, it is safe to
> replace sgx_encl_free_epc_page() with sgx_free_epc_page().
>
> Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized  
> enclave")
> Cc: stable@vger.kernel.org
> Reported-by: Marcelina Kościelnicka <mwk@invisiblethingslab.com>
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/encl.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c  
> b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..41f14b1a3025 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct  
> vm_area_struct *vma,
>  	 * If ret == -EBUSY then page was created in another flow while
>  	 * running without encl->lock
>  	 */
> -	if (ret)
> +	if (ret) {
> +		if (ret == -EBUSY)
> +			vmret = VM_FAULT_NOPAGE;
>  		goto err_out_shrink;
> +	}
> 	pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
>  	pginfo.addr = encl_page->desc & PAGE_MASK;
> @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct  
> vm_area_struct *vma,
>  err_out_shrink:
>  	sgx_encl_shrink(encl, va_page);
>  err_out_epc:
> -	sgx_encl_free_epc_page(epc_page);
> +	sgx_free_epc_page(epc_page);
>  err_out_unlock:
>  	mutex_unlock(&encl->lock);
>  	kfree(encl_page);

Reviewed-by: Haitao Huang <haitao.huang@linux.intel.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-05-15 22:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-15 13:12 [PATCH v2 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows Dmitrii Kuvaiskii
2024-05-15 13:12 ` [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
2024-05-15 13:54   ` Jarkko Sakkinen
2024-05-15 13:56     ` Jarkko Sakkinen
2024-05-15 14:15     ` Dave Hansen
2024-05-15 14:28       ` Jarkko Sakkinen
2024-05-15 15:58   ` Reinette Chatre
2024-05-15 16:39     ` Jarkko Sakkinen
2024-05-15 22:01   ` Haitao Huang
2024-05-15 13:12 ` [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
2024-05-15 14:44   ` Jarkko Sakkinen
2024-05-15 15:52   ` Reinette Chatre
2024-05-15 21:59   ` Haitao Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).