On Mon, 2015-02-16 at 17:38 +0000, Wei Liu wrote:
> On Mon, Feb 16, 2015 at 04:51:43PM +0000, Dario Faggioli wrote:

> > So, if you're saying that, if we use a bitmap, we should write somewhere
> > how libxl would use it, I certainly agree. Up to what level of details
> > we, at that point, should do that, I'm not sure. I think I'd be fine, as
> > a user, if finding it written that "the memory of the vnode will be
> > allocated out of the pnodes specified in the bitmap", with no much
> > further detail, especially considering the use case for the feature.
> > 
> 
> This is of course OK. And the most simple implementation of this
> strategy is to pass on the node information to Xen to let Xen decide
> which node of the several nodes specified to allocate from. This would
> be trivial.
> 
Exactly. This is, by the way, what is happening right now if the user
sets explicitly an hard or a soft affinity for the vcpus that spans
multiple nodes. The domain's node affinity is dynamically calculated to
be, say, 0110, and memory is striped on nodes #1 and #2.

The end result is, most of the time (nearly) even distribution, but, for
instance, if we ran out of free RAM on #1, the allocation will continue
on #2, making things uneven, but still in line with what the user asked.

That behavior would be hence consistent with the already existing,
non-vNUMA case.

> I think having a vnode mapped able to map to several pnode would be
> good. I'm just trying to figure out if a single bitmap is enough to
> cover all the sensible usecases. Or what should we say about that
> interface.
> 
Well, ideally, you may turn the pnode map into another 'nested list',
making it possible to specify, for vnode #2, not only that you want the
memory from pnodes #0 and #4, but that you wants 0.5G from the former
and 1.5G from the latter. However, this:
 - makes the interface very complicated to both specify, understand and
   parse
 - requires non trivial changes inside Xen, which is of course possible,
   but is it worth?
 - very few benefits, as you won't have a fine grained enough control of
   what memory that 0.5G is comprised of (it's all within the same
   vnode!!), neither of what the guest OS will put there, e.g., across
   reboots. So, really, rather useless, IMO

So, yes, I think having a bitmap would be enough for now (for a while,
actually! :-D)

> > As I said, if there is only 1GB free on all pnodes, the user will be
> > allowed to specify a set of pnodes for the vnodes, instead of not being
> > able to use vnuma at all, no matter how libxl (or whoever else) will
> > actually split the memory, in this, previous or future version of Xen...
> > This is the scenario I'm talking about, and in such a scenario, knowing
> > how the split happens, does not really help much, it is just the
> > _possibility_ of splitting, that helps...
> > 
> 
> I don't see this problem that way though.
> 
> Basically you're saying, a user wants to use vNUMA, then at some point
> he / she finds out there is no enough memory in each specified pnode to
> accommodate his / her requirement, then he / she changes the
> configuration on the fly.
> 
Mmm... no, I wasn't thinking about 'on the fly' changes. But perhaps I
don't get what you mean with 'on the fly'.

> In reality, if you have mass deployment you probably won't do that.  You
> might just want to use the same configuration all the time. 
>
Yes, and given you have the same VMs all the times, and especially if
deploy also happens in the same or very similar order all the times,
you'll *always* end up with only 1G left free on all pnodes, and hence
you will *never* be able to turn on vnode-to-pnode mapping to that last
VM that you are deploying that have with 2G vnodes.

> > > > If we allow the user (or the automatic placement algorithm) to specify a
> > > > bitmap of pnode for each vnode, he could put, say, vnode #1 on pnode #0
> > > > and #2, which maybe are really close (in terms of NUMA distances) to
> > > > each other, and vnode #2 to pnode #5 and #6 (close to each others too).
> > > > This would give worst performance than having each vnode on just one
> > > > pnode, but, most likely, better performance than the scenario described
> > > > right above.
> > > > 
> > > 
> > > I get what you mean. So by writing the above paragraphs, you sort of
> > > confirm that there still are too many implications in the algorithms,
> > > right? A user cannot just tell from the interface what the behaviour is
> > > going to be.  
> > >
> > An user can tell that, if he wants a vnode 2GB wide, and there is no
> > pnode with 2GB free, but the sum of free memory in pnode #4 and #6 is >=
> > 2GB, he can still use vNUMA, by paying the (small or high will depends
> > on more factors) price of having that vnode split in two (or more!).
> > 
> 
> What if #4 and #6 do have > 2GB ram each?  What will the behaviour be?
>
Xen decides, while allocating, as it is right now.

> Does it imply better or worse performance?  Again, I'm thinking about
> migrating the same configuration to another version of Xen, or even just
> another host that has enough memory.
> 
> I guess the best we can say (at this point, if we're to use a bitmap),
> is that memory will allocate from the nodes specified, user should not
> expect any specific behaviour -- that basically is telling user not to
> specify multiple nodes...
> 
So, vNUMA is a performance optimization. Actually, no, vNUMA is a
feature someone may want independently from performance, but vNUMA nodes
to pNUMA nodes mapping is **definittely** a performance optimization.

In may experience, performance depends on a lot of things and factors.
There are very few features that just "give better or worse
performance". There are some, sure, but I don't think vNUMA falls in
this category. For instance, in this case, performance will depend on
the workload, on the host load conditions, on the order in which you
boot the VMs on other constraints (e.g., whether or not you use devices
attached to different IO controllers in IONUMA boxes), and that's like
this already.

We should work hard not to cause performance regressions, i.e., making
things worse for people just upgrading Xen and not changing anything
else, but, in this case, not changing *anything* *else* means taking
into account all the factors above (or more!). Then yes, if one really
manages to keep all the involved variables and factors fixed, then just
upgrading Xen should not degrade performance. Actually, it ideally would
make things better... isn't this what we're supposed to do here all
day? :-P :-P

IOW, in this case, if one wants top determinism, he can just set _only_
one bit in the bitmap and forget about it. OTOH, if one can trade a
slight (potential) degradation in perf (which of course also means less
determinism) with increased flexibility, the bitmap would help.

Allow me another example, migration, since you're mentioning it
yourself. Assume we have vnode #0 mapped on pnode #2 and vnode #1 mapped
on pnode #4 on host X, with distance between these pnodes being 20. We
migrate the VM to host Y, and there is not enough free memory on any
couple of pnodes with distance 20, so we pick two with distance 40: this
will alter the performance. What if there is free memory in such two
pnodes, because they're bigger on Y than on X, but there happens to be
more contention on the pcpus of those nodes on Y than on X (as a
consequence o them being bigger, or because on Y there are more small
domains, wrt memory, but with more vcpus, or ...): this will alter the
performance (making things worse). What if there is less contention:
this will alter the performance (making things better). What if there
are not even just two pnodes with enough free memory, no matter the
distance, and we need to turn vnode-to-pnode mapping off: this will
alter the performance. What if the new host is non-NUMA: this will alter
the performance.

So, in summary. vnode-to-pnode is helping performance for you workload:
jolly good. You migrate the VM(s): all bets are off! You upgrade Xen:
well, if you don't change anything else, things should be equal or
better; if you change something else, there is the chance that you need
to re-evaluate the performance and adapt the workload... but that's the
case already, no matter whether a bitmap or an integer is used.

> > I think there would be room for some increased user satisfaction in
> > this, even without knowing much and/or being in control on how exactly
> > the split happens, as there are chances for performance to be (if the
> > thing is used properly) better than in the no-vNUMA case, which is what
> > we're after.
> > 
> > > You can of course say the algorithm is fixed but I don't
> > > think we want to do that?
> > > 
> > I don't want to, but I don't think it's needed.
> > 
> > Anyway, I'm more than ok if we want to defer the discussion to after
> > this series is in. It will require a further change in the interface,
> > but I don't think it would be a terrible price to pay, if we decide the
> > feature is worth.
> > 
> > Or, and that was the other thing I was suggesting, we can have the
> > bitmap in vnode_info since now, but then only accept ints in xl config
> > parsing, and enforce the weight of the bitmap to be 1 (and perhaps print
> > a warning) for now. This would not require changing the API in future,
> > it'd just be a matter of changing the xl config file parsing. The
> > "problem" would still stand for libxl callers different than xl, though,
> > I know.
> > 
> 
> Note that the uint32_t mapping has a very rigid semantics.  
>
It has. Then, whether or not this rigidness buys you more consistent
performance, for example across migrations, it is questionable, and my
opinion would be that no, it does not.

> As long as
> you give me a well-defined semantics of that bitmap I'm fine with this.
> Otherwise I feel more comfortable with the interface as it is.
> 
The well defined semantic is:

<<The memory will be allocated out of the NUMA nodes specified in the
bitmap. If the bitmap has more than 1 bit set, how the memory is
actually split between the nodes, is determined by libxl and Xen
internals, taking into account the amount of free memory, system load,
and other factors. Having a bitmap provides flexibility, and increases
the chances of being able to exploit the vnode-to-pnode mapping feature
(e.g., even on highly packed hosts), at the cost of (potentially)
diminished determinism. For top determinism, always set just one bit>>

BTW, sorry for being so long... I at least hope I've expressed my view
clearly enough. Let me also repeat that I'm fine leaving this alone and
(perhaps) coming back to it later, when the series is merged.

Thanks and Regards,
Dario