Re: [PATCH RFC] x86/sgx: Add trivial NUMA allocation

From: Dave Hansen <dave.hansen@intel.com>
To: Jarkko Sakkinen <jarkko@kernel.org>, x86@kernel.org
Cc: linux-kernel@vger.kernel.org, linux-sgx@vger.kernel.org,
	Sean Christopherson <seanjc@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Jiri Kosina <trivial@kernel.org>
Subject: Re: [PATCH RFC] x86/sgx: Add trivial NUMA allocation
Date: Tue, 12 Jan 2021 16:24:01 -0800	[thread overview]
Message-ID: <34b1acd1-e769-0dc2-a225-8ce3d2b6a085@intel.com> (raw)
In-Reply-To: <20201216135031.21518-1-jarkko@kernel.org>

On 12/16/20 5:50 AM, Jarkko Sakkinen wrote:
> Create a pointer array for each NUMA node with the references to the
> contained EPC sections. Use this in __sgx_alloc_epc_page() to knock the
> current NUMA node before the others.

It makes it harder to comment when I'm not on cc.

Hint, hint... ;)

We need a bit more information here as well.  What's the relationship
between NUMA nodes and sections?  How does the BIOS tell us which NUMA
nodes a section is in?  Is it the same or different from normal RAM and
PMEM?

> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index c519fc5f6948..0da510763c47 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -13,6 +13,13 @@
>  #include "encl.h"
>  #include "encls.h"
>  
> +struct sgx_numa_node {
> +	struct sgx_epc_section *sections[SGX_MAX_EPC_SECTIONS];
> +	int nr_sections;
> +};

So, we have a 'NUMA node' structure already: pg_data_t.  Why don't we
just hang the epc sections off there?

> +static struct sgx_numa_node sgx_numa_nodes[MAX_NUMNODES];

Hmm...  Time to see if I can still do math.

#define SGX_MAX_EPC_SECTIONS            8

(sizeof(struct sgx_epc_section *) + sizeof(int)) * 8 * MAX_NUMNODES

CONFIG_NODES_SHIFT=10 (on Fedora)
#define MAX_NUMNODES (1 << NODES_SHIFT)

12*8*1024 = ~100k.  Yikes.  For *EVERY* system that enables SGX,
regardless if they are NUMA or not.'

Trivial is great, this may be too trivial.

Adding a list_head to pg_data_t and sgx_epc_section would add
SGX_MAX_EPC_SECTIONS*sizeof(list_head)=192 bytes plus 16 bytes per
*present* NUMA node.

>  /**
>   * __sgx_alloc_epc_page() - Allocate an EPC page
>   *
> @@ -485,14 +511,19 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_sec
>   */
>  struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  {
> -	struct sgx_epc_section *section;
>  	struct sgx_epc_page *page;
> +	int nid = numa_node_id();
>  	int i;
>  
> -	for (i = 0; i < sgx_nr_epc_sections; i++) {
> -		section = &sgx_epc_sections[i];
> +	page = __sgx_alloc_epc_page_from_node(nid);
> +	if (page)
> +		return page;
>  
> -		page = __sgx_alloc_epc_page_from_section(section);
> +	for (i = 0; i < sgx_nr_numa_nodes; i++) {
> +		if (i == nid)
> +			continue;

Yikes.  That's a horribly inefficient loop.  Consider if nodes 0 and
1023 were the only ones with EPC.  What would this loop do?  I think
it's a much better idea to keep a nodemask_t of nodes that have EPC.
Then, just do bitmap searches.

> +		page = __sgx_alloc_epc_page_from_node(i);
>  		if (page)
>  			return page;
>  	}
> @@ -661,11 +692,28 @@ static inline u64 __init sgx_calc_section_metric(u64 low, u64 high)
>  	       ((high & GENMASK_ULL(19, 0)) << 32);
>  }
>  
> +static int __init sgx_pfn_to_nid(unsigned long pfn)
> +{
> +	pg_data_t *pgdat;
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		pgdat = NODE_DATA(nid);
> +
> +		if (pfn >= pgdat->node_start_pfn &&
> +		    pfn < (pgdat->node_start_pfn + pgdat->node_spanned_pages))
> +			return nid;
> +	}
> +
> +	return 0;
> +}

I'm not positive this works.  I *thought* these ->node_start_pfn and
->node_spanned_pages are really only guaranteed to cover memory which is
managed by the kernel and has 'struct page' for it.

EPC doesn't have a 'struct page', so won't necessarily be covered by the
pgdat-> and zone-> ranges.  I *think* you may have to go all the way
back to the ACPI SRAT for this.

It would also be *possible* to have an SRAT constructed like this:

0->1GB System RAM - Node 0
1->2GB Reserved   - Node 1
2->3GB System RAM - Node 0

Where the 1->2GB is EPC.  The Node 0 pg_data_t would be:

	pgdat->node_start_pfn = 0
	pgdat->node_spanned_pages = 3GB