Re: [PATCH RFC] x86/sgx: Add trivial NUMA allocation

From: Jarkko Sakkinen <jarkko@kernel.org>
To: Dave Hansen <dave.hansen@intel.com>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	linux-sgx@vger.kernel.org,
	Sean Christopherson <seanjc@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Jiri Kosina <trivial@kernel.org>
Subject: Re: [PATCH RFC] x86/sgx: Add trivial NUMA allocation
Date: Thu, 14 Jan 2021 19:54:50 +0200	[thread overview]
Message-ID: <YACFasdDi0siM2ML@kernel.org> (raw)
In-Reply-To: <34b1acd1-e769-0dc2-a225-8ce3d2b6a085@intel.com>

On Tue, Jan 12, 2021 at 04:24:01PM -0800, Dave Hansen wrote:
> On 12/16/20 5:50 AM, Jarkko Sakkinen wrote:
> > Create a pointer array for each NUMA node with the references to the
> > contained EPC sections. Use this in __sgx_alloc_epc_page() to knock the
> > current NUMA node before the others.
> 
> It makes it harder to comment when I'm not on cc.
> 
> Hint, hint... ;)
> 
> We need a bit more information here as well.  What's the relationship
> between NUMA nodes and sections?  How does the BIOS tell us which NUMA
> nodes a section is in?  Is it the same or different from normal RAM and
> PMEM?

How does it go with pmem?

> > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > index c519fc5f6948..0da510763c47 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -13,6 +13,13 @@
> >  #include "encl.h"
> >  #include "encls.h"
> >  
> > +struct sgx_numa_node {
> > +	struct sgx_epc_section *sections[SGX_MAX_EPC_SECTIONS];
> > +	int nr_sections;
> > +};
> 
> So, we have a 'NUMA node' structure already: pg_data_t.  Why don't we
> just hang the epc sections off there?
> 
> > +static struct sgx_numa_node sgx_numa_nodes[MAX_NUMNODES];
> 
> Hmm...  Time to see if I can still do math.
> 
> #define SGX_MAX_EPC_SECTIONS            8
> 
> (sizeof(struct sgx_epc_section *) + sizeof(int)) * 8 * MAX_NUMNODES
> 
> CONFIG_NODES_SHIFT=10 (on Fedora)
> #define MAX_NUMNODES (1 << NODES_SHIFT)
> 
> 12*8*1024 = ~100k.  Yikes.  For *EVERY* system that enables SGX,
> regardless if they are NUMA or not.'
> 
> Trivial is great, this may be too trivial.
> 
> Adding a list_head to pg_data_t and sgx_epc_section would add
> SGX_MAX_EPC_SECTIONS*sizeof(list_head)=192 bytes plus 16 bytes per
> *present* NUMA node.

I'm hearing you.

> >  /**
> >   * __sgx_alloc_epc_page() - Allocate an EPC page
> >   *
> > @@ -485,14 +511,19 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_sec
> >   */
> >  struct sgx_epc_page *__sgx_alloc_epc_page(void)
> >  {
> > -	struct sgx_epc_section *section;
> >  	struct sgx_epc_page *page;
> > +	int nid = numa_node_id();
> >  	int i;
> >  
> > -	for (i = 0; i < sgx_nr_epc_sections; i++) {
> > -		section = &sgx_epc_sections[i];
> > +	page = __sgx_alloc_epc_page_from_node(nid);
> > +	if (page)
> > +		return page;
> >  
> > -		page = __sgx_alloc_epc_page_from_section(section);
> > +	for (i = 0; i < sgx_nr_numa_nodes; i++) {
> > +		if (i == nid)
> > +			continue;
> 
> Yikes.  That's a horribly inefficient loop.  Consider if nodes 0 and
> 1023 were the only ones with EPC.  What would this loop do?  I think
> it's a much better idea to keep a nodemask_t of nodes that have EPC.
> Then, just do bitmap searches.

OK, so we need to maintain nodemask_t containing nodes with EPC.

Probably also split the patch to a patch that builds nodemask_t,
and the one that uses it. They are separately reviewable clearly.
The allocation part can considered sorted out now.
> 
> > +		page = __sgx_alloc_epc_page_from_node(i);
> >  		if (page)
> >  			return page;
> >  	}
> > @@ -661,11 +692,28 @@ static inline u64 __init sgx_calc_section_metric(u64 low, u64 high)
> >  	       ((high & GENMASK_ULL(19, 0)) << 32);
> >  }
> >  
> > +static int __init sgx_pfn_to_nid(unsigned long pfn)
> > +{
> > +	pg_data_t *pgdat;
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		pgdat = NODE_DATA(nid);
> > +
> > +		if (pfn >= pgdat->node_start_pfn &&
> > +		    pfn < (pgdat->node_start_pfn + pgdat->node_spanned_pages))
> > +			return nid;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> I'm not positive this works.  I *thought* these ->node_start_pfn and
> ->node_spanned_pages are really only guaranteed to cover memory which is
> managed by the kernel and has 'struct page' for it.
> 
> EPC doesn't have a 'struct page', so won't necessarily be covered by the
> pgdat-> and zone-> ranges.  I *think* you may have to go all the way
> back to the ACPI SRAT for this.
> 
> It would also be *possible* to have an SRAT constructed like this:
> 
> 0->1GB System RAM - Node 0
> 1->2GB Reserved   - Node 1
> 2->3GB System RAM - Node 0
> 
> Where the 1->2GB is EPC.  The Node 0 pg_data_t would be:
> 
> 	pgdat->node_start_pfn = 0
> 	pgdat->node_spanned_pages = 3GB

If I've understood the current Linux memory architecture correctly.

- Memory is made available through mm/memory_hotplug.c, which is populated
  by drivers/acpi/acpi_memhotplug.c.
- drivers/acpi/numa/srat.c provides the conversion API from proximity node to
  logical node but I'm not *yet* sure how the interaction goes with memory
  hot plugging

I'm not sure of I'm following the idea of alternative SRAT construciton.
So are you saying that srat.c would somehow group pxm's with EPC to
specific node numbers?

/Jarkko