Re: [PATCH v5 06/24] libxl: introduce vNUMA types

From: Wei Liu <wei.liu2@citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>,
	"JBeulich@suse.com" <JBeulich@suse.com>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	"ufimtseva@gmail.com" <ufimtseva@gmail.com>,
	Ian Jackson <Ian.Jackson@citrix.com>,
	Ian Campbell <Ian.Campbell@citrix.com>
Subject: Re: [PATCH v5 06/24] libxl: introduce vNUMA types
Date: Mon, 16 Feb 2015 16:11:55 +0000	[thread overview]
Message-ID: <20150216161155.GF20572@zion.uk.xensource.com> (raw)
In-Reply-To: <1424102178.2591.12.camel@citrix.com>

On Mon, Feb 16, 2015 at 03:56:21PM +0000, Dario Faggioli wrote:
> On Mon, 2015-02-16 at 15:17 +0000, Wei Liu wrote:
> > On Mon, Feb 16, 2015 at 02:58:32PM +0000, Dario Faggioli wrote:
> 
> > > > +libxl_vnode_info = Struct("vnode_info", [
> > > > +    ("memkb", MemKB),
> > > > +    ("distances", Array(uint32, "num_distances")), # distances from this node to other nodes
> > > > +    ("pnode", uint32), # physical node of this node
> > > >
> > > I am unsure whether we ever discussed this already or not (and sorry for
> > > not recalling) but, in principle, one vnode can be mapped to more than
> > > just one pnode.
> > > 
> > 
> > I don't recall either.
> > 
> > > Semantic would be that the memory of the vnode is somehow split (evenly,
> > > by default, I would say) between the specified pnodes. So, pnode could
> > > be a bitmap too (and be called "pnodes" :-) ), although we can put
> > > checks in place that --for now-- it always have only one bit set.
> > > 
> > > Reasons might be that the user just wants it, or that there is not
> > > enough (free) memory on just one pnode, but we still want to achieve
> > > some locality.
> > > 
> > 
> > Wouldn't this cause unpredictable performance? 
> >
> A certain amount of it, yes, for sure, but always less than having the
> memory striped on all nodes, I would say.
> 
> Well, of course it depends on how it will be used, as usual with these
> things...
> 
> > And there is no way to
> > specify priority among the group of nodes you specify with a single
> > bitmap.
> > 
> Why do we need such a thing as a 'priority'? What I'm talking about is
> making it possible, for each vnode, to specify vnode-to-pnode mapping as
> a bitmap of pnode. What we'd do, in presence of a bitmap, would be
> allocating the memory by striping it across _all_ the pnodes present in
> the bitmap.
> 

Should we enforce memory equally stripped across all nodes? If so this
should be stated explicitly in the comment of interface.  I can't see
that in your original description. I ask "priority" because I
interpreted as something else (which is one of many ways to interpret
I think).

If it's up to libxl to make dynamic choice, we should also say that. But
this is not very useful to user because libxl's algorithm can change
isn't it? How do users expect to know that across versions of Xen?

> If there's only one bit set, you have the same behavior as in this
> patch.
> 
> > I can't say I fully understand the implication of the scenario you
> > described.
> > 
> Ok. Imagine you want to create a guest with 2 vnodes, 4GB RAM total, so
> 2GB on each vnode. On the host, you have 8 pnodes, but only 1GB free on
> each of them.
> 
> If you can only associate a vnode with a single pnode, there is no node
> that can accommodate a full vnode, and we would have to give up trying
> to place the domain and map the vnodes, and we'll end up with 0.5GB on
> each pnode, unpredictable perf, and, basically, no vnuma at all (or at
> least no vnode-to-pnode mapping)... Does this make sense?
> 
> If we allow the user (or the automatic placement algorithm) to specify a
> bitmap of pnode for each vnode, he could put, say, vnode #1 on pnode #0
> and #2, which maybe are really close (in terms of NUMA distances) to
> each other, and vnode #2 to pnode #5 and #6 (close to each others too).
> This would give worst performance than having each vnode on just one
> pnode, but, most likely, better performance than the scenario described
> right above.
> 

I get what you mean. So by writing the above paragraphs, you sort of
confirm that there still are too many implications in the algorithms,
right? A user cannot just tell from the interface what the behaviour is
going to be.  You can of course say the algorithm is fixed but I don't
think we want to do that?

Wei.

> Hope I made myself clear enough :-)
> 
> Regards,
> Dario