All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
To: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: linux-sgx@vger.kernel.org
Subject: Re: [PATCH for_v21 1/2] x86/sgx: Use SRCU to protect mm_list during reclaim
Date: Fri, 12 Jul 2019 00:13:07 +0300	[thread overview]
Message-ID: <20190711211307.cymfaxeuq73v63xg@linux.intel.com> (raw)
In-Reply-To: <20190711161625.15786-2-sean.j.christopherson@intel.com>

On Thu, Jul 11, 2019 at 09:16:24AM -0700, Sean Christopherson wrote:
> Reclaiming enclaves faces a bit of a conundrum when it comes to lock
> ordering.  The reclaim flows need to take mmap_sem for read, e.g. to age
> and zap PTEs on arbitrary mm_structs.  But reclaim must first walk the
> enclave's list of mm_structs, which could be modified asynchronously to
> reclaim.  Because modifying the list of mm_structs is done in reaction
> to vma changes, i.e. with mmap_sem held exclusively, taking enclave's
> mm_lock to protect the list walk in reclaim would lead to deadlocks due
> to conflicting lock ordering.  To avoid this, SGX currently uses a
> custom walker that drops mm_lock and restarts the walk as needed.

+1

> Use SRCU to protect reclaim instead of using a custom walker to avoid
> the aforementioned lock issues.  Using SRCU improves readability in the
> reclaimer by eliminating the need to juggle mm_lock during reclaim since
> it can take mmap_sem() underneath srcu_read_lock().  And since relcaim
> doesn't drop its SRCU read lock, there is no need to grab a reference to
> encl_mm.

Speaking about "lock issue" would mean to me an actual regression. I do
agree that SRCU is a the right step forward.

> Not taking a reference to encl_mm is not just an optimization, it's also
> functionally necessary and a major motivation to moving to SRCu. Putting
> the reference can invoke sgx_encl_mm_release(), which calls
> synchronize_srcu() and will deadlock if done while holding the SRCU read
> lock.  Not taking a reference paves the way for additional refcounting
> improvements that would be extremely difficult to implement when using
> the custom walker due to cyclical dependencies on the refcount.

I'm not sure I get this. The existing code does not have a call to
synchronize_srcu().

> Speaking of sgx_encl_mm_release(), the whole purpose of using SRCU is
> that sgx_encl_mm_release() is blocked (if called on another cpu) by
> synchronize_srcu(), which in turn prevents mmdrop() from freeing the
> mm_struct while reclaim is in the SRCU critical section.  Ultimately,
> reclaim just needs to ensure mm_struct isn't freed so that it can call
> mmget_not_zero() to prevent the page tables from being dropped while it
> accesses PTEs, i.e. it doesn't matter if the encl_mm is dying, reclaim
> just needs to make sure it's not fully dead.

+1

> To avoid calling synchronize_rcu() while holding rcu_read_lock(), use
> mmput_async() in the reclaimer, e.g. __mmput() closes all VMAs, thus
> triggering sgx_encl_mm_release() and synchronize_srcu().  Alternatively
> sgx_encl_mm_release() could always call synchronize_rcu() in a worker
> thread (see below), but doing __mmput() in a worker thread is desirable
> from an SGX performance perspective, i.e. doesn't stall the reclaimer
> CPU to release the mm.

+1

> 
> And finally, the last deadlock scenario is if sgx_encl_mm_release() is
> invoked on an in-use mm_struct, e.g. via munmap().
> 
> CPU0                     CPU1
> munmap()
> down_write(&mmap_sem)
>                          srcu_read_lock()
> 
> synchronize_srcu()
>                          down_read(&mmap_sem) <- deadlock
> 
> Avoid deadlock in this scenario by synchronizing SRCU via a worker
> thread.  SRCU ensures only the liveliness of the mm_struct itself,
> which is guaranteed by an mmgrab() prior to scheduling the work.
> The reclaimer is responsible for checking mm_users and the VMAs to
> ensure it doesn't touch stale PTEs, i.e. delaying synchronization does
> not affect the reclaimer's responsiblities.  The delay does add one new
> wrinkle in that sgx_encl_mm_add() and sgx_vma_open() can see a dying
> encl_mm.  Previously this was prevented by virtue of sgx_vma_close()
> being mutually exclusive (the caller must hold down_write(&mmap_sem)).
> Handle such a case by using kref_get_unless_zero().
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/Kconfig                      |   1 +
>  arch/x86/kernel/cpu/sgx/driver/main.c |  34 ++----
>  arch/x86/kernel/cpu/sgx/encl.c        | 165 ++++++++++++++------------
>  arch/x86/kernel/cpu/sgx/encl.h        |   9 +-
>  arch/x86/kernel/cpu/sgx/reclaim.c     |  71 ++++-------
>  5 files changed, 124 insertions(+), 156 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index a0fd17c32521..17558cf48a8a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1918,6 +1918,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
>  config INTEL_SGX
>  	bool "Intel SGX core functionality"
>  	depends on X86_64 && CPU_SUP_INTEL
> +	select SRCU
>  	---help---
>  	  Intel(R) SGX is a set of CPU instructions that can be used by
>  	  applications to set aside private regions of code and data, referred
> diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c b/arch/x86/kernel/cpu/sgx/driver/main.c
> index c7fc32e26105..27076754f7d8 100644
> --- a/arch/x86/kernel/cpu/sgx/driver/main.c
> +++ b/arch/x86/kernel/cpu/sgx/driver/main.c
> @@ -25,6 +25,7 @@ u32 sgx_xsave_size_tbl[64];
>  static int sgx_open(struct inode *inode, struct file *file)
>  {
>  	struct sgx_encl *encl;
> +	int ret;
>  
>  	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
>  	if (!encl)
> @@ -38,6 +39,12 @@ static int sgx_open(struct inode *inode, struct file *file)
>  	INIT_LIST_HEAD(&encl->mm_list);
>  	spin_lock_init(&encl->mm_lock);
>  
> +	ret = init_srcu_struct(&encl->srcu);
> +	if (ret) {
> +		kfree(encl);
> +		return ret;
> +	}
> +
>  	file->private_data = encl;
>  
>  	return 0;
> @@ -65,25 +72,6 @@ static long sgx_compat_ioctl(struct file *filep, unsigned int cmd,
>  }
>  #endif
>  
> -static int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
> -{
> -	struct sgx_encl_mm *encl_mm;
> -
> -	encl_mm = kzalloc(sizeof(*encl_mm), GFP_KERNEL);
> -	if (!encl_mm)
> -		return -ENOMEM;
> -
> -	encl_mm->encl = encl;
> -	encl_mm->mm = mm;
> -	kref_init(&encl_mm->refcount);
> -
> -	spin_lock(&encl->mm_lock);
> -	list_add(&encl_mm->list, &encl->mm_list);
> -	spin_unlock(&encl->mm_lock);
> -
> -	return 0;
> -}
> -
>  /**
>   * sgx_calc_vma_prot() - Calculate VMA prot bits
>   * @encl:	an enclave
> @@ -129,11 +117,9 @@ static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
>  	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC) & ~vm_prot_bits)
>  		return -EACCES;
>  
> -	if (!sgx_encl_get_mm(encl, vma->vm_mm)) {
> -		ret = sgx_encl_mm_add(encl, vma->vm_mm);
> -		if (ret)
> -			return ret;
> -	}
> +	ret = sgx_encl_mm_add(encl, vma->vm_mm);
> +	if (ret)
> +		return ret;
>  
>  	if (!(vm_prot_bits & VM_READ))
>  		vma->vm_flags &= ~VM_MAYREAD;
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 853ea8ef3ada..64ae7d705eb1 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -132,62 +132,116 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>  	return entry;
>  }
>  
> -void sgx_encl_mm_release(struct kref *ref)
> +static void sgx_encl_mm_release_deferred(struct work_struct *work)
> +{
> +	struct sgx_encl_mm *encl_mm =
> +		container_of(work, struct sgx_encl_mm, release_work);
> +
> +	mmdrop(encl_mm->mm);
> +	synchronize_srcu(&encl_mm->encl->srcu);
> +	kfree(encl_mm);
> +}
> +
> +static void sgx_encl_mm_release(struct kref *ref)
>  {
>  	struct sgx_encl_mm *encl_mm =
>  		container_of(ref, struct sgx_encl_mm, refcount);
>  
>  	spin_lock(&encl_mm->encl->mm_lock);
> -	list_del(&encl_mm->list);
> +	list_del_rcu(&encl_mm->list);
>  	spin_unlock(&encl_mm->encl->mm_lock);
>  
> -	kfree(encl_mm);
> +	/*
> +	 * If the mm has users, then do SRCU synchronization in a worker thread
> +	 * to avoid a deadlock with the reclaimer.  The caller holds mmap_sem
> +	 * for write, while the reclaimer will acquire mmap_sem for read if it
> +	 * is reclaiming from this enclave.  Invoking synchronize_srcu() here
> +	 * will hang waiting for the reclaimer to drop its RCU read lock, while
> +	 * the reclaimer will get stuck waiting to acquire mmap_sem.  The async
> +	 * shenanigans can be avoided if there are no mm users as the reclaimer
> +	 * will not acquire mmap_sem in that case.
> +	 */
> +	if (atomic_read(&encl_mm->mm->mm_users)) {
> +		mmgrab(encl_mm->mm);
> +		INIT_WORK(&encl_mm->release_work, sgx_encl_mm_release_deferred);
> +		schedule_work(&encl_mm->release_work);
> +	} else {
> +		synchronize_srcu(&encl_mm->encl->srcu);
> +		kfree(encl_mm);
> +	}
>  }
>  
> -struct sgx_encl_mm *sgx_encl_get_mm(struct sgx_encl *encl,
> -				    struct mm_struct *mm)
> +static struct sgx_encl_mm *sgx_encl_find_mm(struct sgx_encl *encl,
> +					    struct mm_struct *mm)
>  {
>  	struct sgx_encl_mm *encl_mm = NULL;
> -	struct sgx_encl_mm *prev_mm = NULL;
> -	int iter;
> +	struct sgx_encl_mm *tmp;
> +	int idx;
>  
> -	while (true) {
> -		encl_mm = sgx_encl_next_mm(encl, prev_mm, &iter);
> -		if (prev_mm)
> -			kref_put(&prev_mm->refcount, sgx_encl_mm_release);
> -		prev_mm = encl_mm;
> +	idx = srcu_read_lock(&encl->srcu);
>  
> -		if (iter == SGX_ENCL_MM_ITER_DONE)
> +	list_for_each_entry_rcu(tmp, &encl->mm_list, list) {
> +		if (tmp->mm == mm) {
> +			encl_mm = tmp;
>  			break;
> -
> -		if (iter == SGX_ENCL_MM_ITER_RESTART)
> -			continue;
> -
> -		if (mm == encl_mm->mm)
> -			return encl_mm;
> +		}
>  	}
>  
> -	return NULL;
> +	srcu_read_unlock(&encl->srcu, idx);
> +
> +	return encl_mm;
>  }
>  
> -
> -static void sgx_vma_open(struct vm_area_struct *vma)
> +int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
>  {
> -	struct sgx_encl *encl = vma->vm_private_data;
> +	struct sgx_encl_mm *encl_mm;
>  
> -	if (!encl)
> -		return;
> +	lockdep_assert_held_exclusive(&mm->mmap_sem);

Just a question: what does it check (12:10AM too tired to check,
apologies)?

Anyway, no blocking issues. Thank you.

Acked-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>

/Jarkko

  reply	other threads:[~2019-07-11 21:13 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-11 16:16 [PATCH for_v21 0/2] x86/sgx: Use SRCU and mmu_notifier Sean Christopherson
2019-07-11 16:16 ` [PATCH for_v21 1/2] x86/sgx: Use SRCU to protect mm_list during reclaim Sean Christopherson
2019-07-11 21:13   ` Jarkko Sakkinen [this message]
2019-07-11 21:25     ` Sean Christopherson
2019-07-12  3:44       ` Jarkko Sakkinen
2019-07-11 16:16 ` [PATCH for_v21 2/2] x86/sgx: Use mmu_notifier.release() instead of per-vma refcounting Sean Christopherson
2019-07-11 21:16   ` Jarkko Sakkinen
2019-07-11 18:01 ` [PATCH for_v21 0/2] x86/sgx: Use SRCU and mmu_notifier Jarkko Sakkinen
2019-07-11 21:51 ` Jarkko Sakkinen
2019-07-12  4:17   ` Jarkko Sakkinen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190711211307.cymfaxeuq73v63xg@linux.intel.com \
    --to=jarkko.sakkinen@linux.intel.com \
    --cc=linux-sgx@vger.kernel.org \
    --cc=sean.j.christopherson@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.