Re: [PATCH v39 13/24] x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES

From: Dave Hansen <dave.hansen@intel.com>
To: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: x86@kernel.org, linux-sgx@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Jethro Beekman <jethro@fortanix.com>,
	Haitao Huang <haitao.huang@linux.intel.com>,
	Chunyang Hui <sanqian.hcy@antfin.com>,
	Jordan Hand <jorhand@linux.microsoft.com>,
	Nathaniel McCallum <npmccallum@redhat.com>,
	Seth Moore <sethmo@google.com>,
	Darren Kenny <darren.kenny@oracle.com>,
	Sean Christopherson <sean.j.christopherson@intel.com>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	akpm@linux-foundation.org, andriy.shevchenko@linux.intel.com,
	asapek@google.com, bp@alien8.de, cedric.xing@intel.com,
	chenalexchen@google.com, conradparker@google.com,
	cyhanish@google.com, haitao.huang@intel.com, kai.huang@intel.com,
	kai.svahn@intel.com, kmoy@google.com, ludloff@google.com,
	luto@kernel.org, nhorman@redhat.com, puiterwijk@redhat.com,
	rientjes@google.com, tglx@linutronix.de, yaozhangx@google.com,
	mikko.ylinen@intel.com
Subject: Re: [PATCH v39 13/24] x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES
Date: Mon, 19 Oct 2020 13:48:32 -0700	[thread overview]
Message-ID: <516a1b7a-38cc-adde-833b-b661cbee97f2@intel.com> (raw)
In-Reply-To: <20201018050311.GK68722@linux.intel.com>

On 10/17/20 10:03 PM, Jarkko Sakkinen wrote:
> On Fri, Oct 16, 2020 at 02:25:50PM -0700, Dave Hansen wrote:
>>> +/**
>>> + * struct sgx_enclave_add_pages - parameter structure for the
>>> + *                                %SGX_IOC_ENCLAVE_ADD_PAGE ioctl
>>> + * @src:	start address for the page data
>>> + * @offset:	starting page offset
>>
>> Is this the offset *within* the page?  Might be nice to say that.
> 
> It's the offset in the enclave address range where page is to be added.

Yikes, comment improvement needed, badly.

>>> +static struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
>>> +						 unsigned long offset,
>>> +						 u64 secinfo_flags)
>>> +{
>>> +	struct sgx_encl_page *encl_page;
>>> +	unsigned long prot;
>>> +
>>> +	encl_page = kzalloc(sizeof(*encl_page), GFP_KERNEL);
>>> +	if (!encl_page)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	encl_page->desc = encl->base + offset;
>>> +	encl_page->encl = encl;
>>
>> Somewhere, we need an explanation of why we have 'sgx_epc_page' and
>> 'sgx_encl_page'.  I think they're 1:1 at least after
>> sgx_encl_page_alloc(), so I'm wondering why we need two.
> 
> You need sgx_encl_page to hold data that exists whether or not there is
> an associated EPC page.

Except they're currently tightly bound:

>         encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
>         if (IS_ERR(encl_page))
>                 return PTR_ERR(encl_page);
>  
> -       epc_page = __sgx_alloc_epc_page();
> +       epc_page = sgx_alloc_epc_page(encl_page, true);
>         if (IS_ERR(epc_page)) {
>                 kfree(encl_page);
>                 return PTR_ERR(epc_page);
>         }

So, is this because 'sgx_encl_page' continues to exist even if
'sgx_epc_page' is reclaimed?

> Essentially sgx_encl_page contains the data needed for a virtual page,
> and sgx_epc_page what is needed to represent physical page.

So, grumble grumble, that's horribly inefficient for sparse mappings.
There's a reason VMAs cover ranges instead of being allocated
per-virtual-page.

> None of the data contained in sgx_encl_page make sense for sgx_epc_page.
> They don't contain intersecting or redundant data.

Yeah, except they point to each other, so if one isn't necessary, we can
get rid of that pointer.

>>> +static int __sgx_encl_add_page(struct sgx_encl *encl,
>>> +			       struct sgx_encl_page *encl_page,
>>> +			       struct sgx_epc_page *epc_page,
>>> +			       struct sgx_secinfo *secinfo, unsigned long src)
>>> +{
>>> +	struct sgx_pageinfo pginfo;
>>> +	struct vm_area_struct *vma;
>>> +	struct page *src_page;
>>> +	int ret;
>>> +
>>> +	/* Deny noexec. */
>>> +	vma = find_vma(current->mm, src);
>>> +	if (!vma)
>>> +		return -EFAULT;
>>> +
>>> +	if (!(vma->vm_flags & VM_MAYEXEC))
>>> +		return -EACCES;
>>> +
>>> +	ret = get_user_pages(src, 1, 0, &src_page, NULL);
>>> +	if (ret < 1)
>>> +		return -EFAULT;
>>> +
>>> +	pginfo.secs = (unsigned long)sgx_get_epc_addr(encl->secs.epc_page);
>>> +	pginfo.addr = SGX_ENCL_PAGE_ADDR(encl_page);
>>> +	pginfo.metadata = (unsigned long)secinfo;
>>> +	pginfo.contents = (unsigned long)kmap_atomic(src_page);
>>> +
>>> +	ret = __eadd(&pginfo, sgx_get_epc_addr(epc_page));
>>
>> Could you convince me that EADD is not going to fault and make the
>> kmap_atomic() mad?
> 
> It can legitly fail in the case when power cycle happens.
> 
> That's why the inline assembly catches faults and return an error code.
> Thhis code has been field tested a lot. I have fairly good trust on
> it.

OK, so it can fault, but not *sleep*.

Can you comment it to that effect, please?

>>> +		if (ret) {
>>> +			if (encls_failed(ret))
>>> +				ENCLS_WARN(ret, "EEXTEND");
>>> +			return -EIO;
>>
>> How frequent should we expect these to be?  Can users cause them?  You
>> should *proably* call it ENCLS_WARN_ONCE() if it's implemented that way.
> 
> If power cycle happens.

So, we get one warning per power cycle?  Practically, do you mean a
suspend/resume cycle, or is this more like hibernation-to-disk-resume?

In any case, if this is normal system operation (which closing my laptop
lid qualifies as), it should produce zero warnings.

>>> +static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
>>> +			     unsigned long offset, struct sgx_secinfo *secinfo,
>>> +			     unsigned long flags)
>>> +{
>>> +	struct sgx_encl_page *encl_page;
>>> +	struct sgx_epc_page *epc_page;
>>> +	int ret;
>>> +
>>> +	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
>>> +	if (IS_ERR(encl_page))
>>> +		return PTR_ERR(encl_page);
>>> +
>>> +	epc_page = __sgx_alloc_epc_page();
>>> +	if (IS_ERR(epc_page)) {
>>> +		kfree(encl_page);
>>> +		return PTR_ERR(epc_page);
>>> +	}
>>
>> Looking at these, I'm forgetting why we need to both allocate an
>> encl_page and an epc_page.  Commends might remind me.  So would better
>> names.
> 
> Should the struct names be renamed?
> 
> Like sgx_phys_epc_page and sgx_virt_epc_page?

"epc" is additional acronym nonsense and redundant with "sgx" and "page"
anyway.

I'd probably call then 'sgx_phys_page' and 'sgx_virt_slot' or something.

>>> +	mmap_read_lock(current->mm);
>>> +	mutex_lock(&encl->lock);
>>> +
>>> +	/*
>>> +	 * Insert prior to EADD in case of OOM.
>>
>> I wouldn't say OOM.  Maybe:
>>
>> 	xa_insert() and EADD can both fail.  But xa_insert() is easier
>> 	to unwind so do it first.
>>
>>>                                              EADD modifies MRENCLAVE, i.e.
>>
>> What is MRENCLAVE?
> 
> The measurement stored in SECS. I'm wondering  with xarray, is it
> possible to preallocate entry without inserting anything?

Let's use plain english here.  I don't care what the implementation
does, I just care about what it means to the kernel.

> Then we could get rid of this unwind and also would not need to
> take encl->lock in sgx_encl_may_map().

There was for radix trees, iirc.

>>> +	 * can't be gracefully unwound, while failure on EADD/EXTEND is limited
>>> +	 * to userspace errors (or kernel/hardware bugs).
>>> +	 */
>>> +	ret = xa_insert(&encl->page_array, PFN_DOWN(encl_page->desc),
>>> +			encl_page, GFP_KERNEL);
>>> +	if (ret)
>>> +		goto err_out_unlock;
>>> +
>>> +	ret = __sgx_encl_add_page(encl, encl_page, epc_page, secinfo,
>>> +				  src);
>>> +	if (ret)
>>> +		goto err_out;
>>> +
>>> +	/*
>>> +	 * Complete the "add" before doing the "extend" so that the "add"
>>> +	 * isn't in a half-baked state in the extremely unlikely scenario
>>> +	 * the enclave will be destroyed in response to EEXTEND failure.
>>> +	 */
>>> +	encl_page->encl = encl;
>>> +	encl_page->epc_page = epc_page;
>>> +	encl->secs_child_cnt++;
>>> +
>>> +	if (flags & SGX_PAGE_MEASURE) {
>>> +		ret = __sgx_encl_extend(encl, epc_page);
>>> +		if (ret)
>>> +			goto err_out;
>>> +	}
>>
>> Why would we never *not* measure an added page?
> 
> You might add Thread Control Structure pages without measuring them or
> data area. There are reasons for the user space not to have everything
> measured.

This is also good comment fodder.

>>> +static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
>>> +{
>>> +	struct sgx_enclave_add_pages addp;
>>> +	struct sgx_secinfo secinfo;
>>> +	unsigned long c;
>>> +	int ret;
>>> +
>>> +	if ((atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED) ||
>>> +	    !(atomic_read(&encl->flags) & SGX_ENCL_CREATED))
>>> +		return -EINVAL;
>>
>> There should to be a nice state machine documented somewhere.  Is ther?
> 
> So should I document to encl.h where they are declared to start with?

I think it's better placed in the Documentation/.

>>> +	if (copy_from_user(&addp, arg, sizeof(addp)))
>>> +		return -EFAULT;
>>> +
>>> +	if (!IS_ALIGNED(addp.offset, PAGE_SIZE) ||
>>> +	    !IS_ALIGNED(addp.src, PAGE_SIZE))
>>> +		return -EINVAL;
>>> +
>>> +	if (!(access_ok(addp.src, PAGE_SIZE)))
>>> +		return -EFAULT;
>>
>> This worries me.  You're doing an access_ok() check on addp.src because
>> you evidently don't trust it.  But, below, it looks to be accessed
>> directly with an offset, bound by addp.length, which I think can be
>>> PAGE_SIZE.
>>
>> I'd feel a lot better if addp.src's value was being passed around as a
>> __user pointer.
> 
> I'm not sure if that call is even needed. Each page is pinned with
> get_user_pages(). AFAIK, it should be enough. This must be legacy cruft.

get_user_pages() and access_ok() do *very* different things.  Even if
the pages are pinned, you might still be tricked into referencing off
the end of the page, or up into the kernel address space.

>>> +	if (addp.length & (PAGE_SIZE - 1))
>>> +		return -EINVAL;
>>> +
>>> +	if (addp.offset + addp.length - PAGE_SIZE >= encl->size)
>>> +		return -EINVAL;
>>> +
>>> +	if (copy_from_user(&secinfo, (void __user *)addp.secinfo,
>>> +			   sizeof(secinfo)))
>>> +		return -EFAULT;
>>> +
>>> +	if (sgx_validate_secinfo(&secinfo))
>>> +		return -EINVAL;
>>> +
>>> +	for (c = 0 ; c < addp.length; c += PAGE_SIZE) {
>>> +		if (signal_pending(current)) {
>>> +			if (!c)
>>> +				ret = -ERESTARTSYS;
>>> +
>>> +			break;
>>> +		}
>>> +
>>> +		if (c == SGX_MAX_ADD_PAGES_LENGTH)
>>> +			break;
>>> +
>>> +		if (need_resched())
>>> +			cond_resched();
>>> +
>>> +		ret = sgx_encl_add_page(encl, addp.src + c, addp.offset + c,
>>> +					&secinfo, addp.flags);
>>
>> Yeah...  Don't we need to do another access_ok() check here, if we
>> needed one above since we are moving away from addrp.src?
> 
> I don't think so because the page is pinned with get_user_pages().

No, get_user_pages() is orthogonal.

Looking at this again, you _might_ be OK since you validated addp.length
against encl->size.  But, it's all very convoluted and doesn't look very
organized or obviously right.

So, this begs the question: What, exactly are the guarantees you are
expecting out of get_user_pages() here?

I also think it's an absolute requirement that if you're passing around
userspace pointers that you tag them as __user, not pass around as
unsigned longs.