Re: [PATCH v4 21/21] xl: vNUMA support

From: Wei Liu <wei.liu2@citrix.com>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>,
	JBeulich@suse.com, andrew.cooper3@citrix.com,
	dario.faggioli@citrix.com, ian.jackson@eu.citrix.com,
	xen-devel@lists.xen.org, ufimtseva@gmail.com
Subject: Re: [PATCH v4 21/21] xl: vNUMA support
Date: Thu, 29 Jan 2015 17:46:51 +0000	[thread overview]
Message-ID: <20150129174651.GI20229@zion.uk.xensource.com> (raw)
In-Reply-To: <1422529839.30641.42.camel@citrix.com>

On Thu, Jan 29, 2015 at 11:10:39AM +0000, Ian Campbell wrote:
> On Wed, 2015-01-28 at 22:52 +0000, Wei Liu wrote:
> > > guests, is preballooning allowed there too?
> > 
> > I need to check PV boot sequence to have a definite answer.
> > 
> > Currently memory allocation in libxc only deals with a chunk of
> > contiguous memory. Not sure if I change that will break some assumptions
> > that guest kernel makes.
> 
> Please do check, and if it doesn't work today we really ought to have
> plan on how to integrate in the future, in case (as seems likely) it
> requires cooperation between tools and kernel -- so we can think about
> steps now to make it easier on ourselves then...
> 

I only look at Linux kernel so this is very Linux centric -- though I
wonder if there are any other PV kernels in the wild.

Libxc allocates contiguous chunk of memory and then guest kernel will
remap memory inside a non-ram region. (This leads me to think I need to
rewrite the patch that allocates memory to also take into account memory
hole, but that is another matter)

Speaking of pre-ballooned PV guest, in theory if we still allocate
memory in contiguous trunk, it should work. But something more complex
like partially populating multiple vnodes might not, because code in
Linux kernel assumes that those pre-ballooned pages are appended to the
end of populated memory.

> > > > +=item B<vnuma_pnode_map=[ NUMBER, NUMBER, ... ]>
> > > > +
> > > > +Specifiy which physical NUMA node a specific virtual NUMA node maps to. The
> > > 
> > > "Specify" again.
> > > 
> > > > +number of elements in this list should be equal to the number of virtual
> > > > +NUMA nodes defined in B<vnuma_memory=>.
> > > 
> > > Would it make sense to instead have a single array or e.g. "NODE:SIZE"
> > > or something?
> > > 
> > 
> > Or "PNODE:SIZE:VCPUS"?
> 
> That seems plausible.
> 
> One concern would be future expansion, perhaps foo=bar,baz=wibble?
> 

I'm fine with that. We can use nested list

vnuma = [ [node=0,size=1024,vcpus=...] [ ...] ]

> > > > +=item B<vnuma_vdistance=[ [NUMBER, ..., NUMBER], [NUMBER, ..., NUMBER], ... ]>
> > > > +
> > > > +Two dimensional list to specify distances among nodes.
> > > > +
> > > > +The number of elements in the first dimension list equals the number of virtual
> > > > +nodes. Each element in position B<i> is a list that specifies the distances
> > > > +from node B<i> to other nodes.
> > > > +
> > > > +For example, for a guest with 2 virtual nodes, user can specify:
> > > > +
> > > > +  vnuma_vdistance = [ [10, 20], [20, 10] ]
> > > 
> > > Any guidance on how a user should choose these numbers?
> > > 
> > 
> > I think using the number from numactl is good enough.
> 
> Worth mentioning in the docs I think.
> 
> > > Do we support a mode where something figures this out based on the
> > > underlying distances between the pnode to which a vnode is assigned?
> > > 
> > 
> > Dario is working on that.
> 
> I thought he was working on automatic numa placement (i.e. figuring out
> the best set of pnodes to map the vnodes to), whereas what I was
> suggesting was that given the user has specified a vnode->pnode mapping
> it should be possible to construct a distances table pretty trivially
> from that. Is Dario working on that too?
> 

Right. That's trivial inside libxl.

What I meant was Dario was about to touch all these automation stuffs it
might be trivial for him to just do it all in one go. Of course if he
has not done that I can add the logic myself.

Wei.

> Ian.