Re: [PATCH v25 07/21] x86/sgx: Enumerate and track EPC sections

From: Sean Christopherson <sean.j.christopherson@intel.com>
To: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, x86@kernel.org,
	linux-sgx@vger.kernel.org, akpm@linux-foundation.org,
	dave.hansen@intel.com, nhorman@redhat.com, npmccallum@redhat.com,
	haitao.huang@intel.com, andriy.shevchenko@linux.intel.com,
	tglx@linutronix.de, kai.svahn@intel.com, bp@alien8.de,
	josh@joshtriplett.org, luto@kernel.org, kai.huang@intel.com,
	rientjes@google.com, cedric.xing@intel.com,
	puiterwijk@redhat.com, Serge Ayoun <serge.ayoun@intel.com>
Subject: Re: [PATCH v25 07/21] x86/sgx: Enumerate and track EPC sections
Date: Wed, 5 Feb 2020 11:57:00 -0800	[thread overview]
Message-ID: <20200205195700.GJ4877@linux.intel.com> (raw)
In-Reply-To: <20200204060545.31729-8-jarkko.sakkinen@linux.intel.com>

On Tue, Feb 04, 2020 at 08:05:31AM +0200, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Enumerate Enclave Page Cache (EPC) sections via CPUID and add the data
> structures necessary to track EPC pages so that they can be allocated,
> freed and managed. As a system may have multiple EPC sections, invoke CPUID
> on SGX sub-leafs until an invalid leaf is encountered.
> 
> For simplicity, support a maximum of eight EPC sections. Existing client
> hardware supports only a single section, while upcoming server hardware
> will support at most eight sections. Bounding the number of sections also
> allows the section ID to be embedded along with a page's offset in a single
> unsigned long, enabling easy retrieval of both the VA and PA for a given
> page.

...

> +++ b/arch/x86/kernel/cpu/sgx/reclaim.c
> @@ -0,0 +1,87 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +// Copyright(c) 2016-19 Intel Corporation.
> +
> +#include <linux/freezer.h>
> +#include <linux/highmem.h>
> +#include <linux/kthread.h>
> +#include <linux/pagemap.h>
> +#include <linux/ratelimit.h>
> +#include <linux/slab.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched/signal.h>
> +#include "encls.h"
> +
> +struct task_struct *ksgxswapd_tsk;
> +
> +/*
> + * Reset all pages to uninitialized state. Pages could be in initialized on
> + * kmemexec.
> + */
> +static void sgx_sanitize_section(struct sgx_epc_section *section)
> +{
> +	struct sgx_epc_page *page, *tmp;
> +	LIST_HEAD(secs_list);
> +	int ret;
> +
> +	while (!list_empty(&section->unsanitized_page_list)) {
> +		if (kthread_should_stop())
> +			return;
> +
> +		spin_lock(&section->lock);
> +
> +		page = list_first_entry(&section->unsanitized_page_list,
> +					struct sgx_epc_page, list);
> +
> +		ret = __eremove(sgx_epc_addr(page));
> +		if (!ret)
> +			list_move(&page->list, &section->page_list);
> +		else
> +			list_move_tail(&page->list, &secs_list);
> +
> +		spin_unlock(&section->lock);
> +
> +		cond_resched();
> +	}
> +
> +	list_for_each_entry_safe(page, tmp, &secs_list, list) {
> +		if (kthread_should_stop())
> +			return;
> +
> +		ret = __eremove(sgx_epc_addr(page));
> +		if (!WARN_ON_ONCE(ret)) {

Sadly, this WARN can fire after kexec() on systems with multiple EPC
sections if the SECS has child pages in another section.

Virtual EPC (KVM) has a similar issue, which I'm handling by collecting all
pages that fail a second EREMOVE in a global list, and then retrying every
page in the global list every time a virtual EPC is destroyed, i.e. might
have freed pages.

The same approach should work here, e.g. retry all SECS pages a third time
once all sections have been sanitized and WARN if EREMOVE fails a third
time on a page.

> +			spin_lock(&section->lock);
> +			list_move(&page->list, &section->page_list);
> +			spin_unlock(&section->lock);
> +		} else {
> +			list_del(&page->list);
> +			kfree(page);
> +		}
> +
> +		cond_resched();
> +	}
> +}
> +
> +static int ksgxswapd(void *p)
> +{
> +	int i;
> +
> +	set_freezable();
> +
> +	for (i = 0; i < sgx_nr_epc_sections; i++)
> +		sgx_sanitize_section(&sgx_epc_sections[i]);

I'm having second thoughts about proactively sanitizing the EPC sections.
I think it'd be better to do EREMOVE the first time a page is allocated,
e.g. add a SGX_EPC_PAGE_UNSANITIZED flag.

  1. Sanitizing EPC that's never used is a waste of cycles, especially on
     platforms with 64gb+ of EPC.

  2. Deferring to allocation time automatically scales with the number of
     tasks that are allocating EPC.  And, the CPU time will also be
     accounted to those tasks.

  3. Breaks on-demand paging when running in a VM, e.g. if the VMM chooses
     to allocate a physical EPC page when it's actually accessed by the
     VM.  I don't expect this to be a problem any time soon, as all VMMs
     will likely preallocate EPC pages until KVM (or any other hypervisor)
     gains EPC oversusbscription support, which may or may not ever happen.
     But, I'd prefer to simply not have the problem in the first place.

The logic will be a bit more complicated, but not terribly so.  

> +
> +	return 0;
> +}