Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

From: Dario Faggioli <raistlin@linux.it>
To: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
Date: Wed, 02 May 2012 18:30:51 +0200	[thread overview]
Message-ID: <1335976251.2961.123.camel@Abyss> (raw)
In-Reply-To: <4FA0051F.6020909@eu.citrix.com>

[-- Attachment #1.1: Type: text/plain, Size: 11868 bytes --]

On Tue, 2012-05-01 at 16:45 +0100, George Dunlap wrote: 
> Hey Dario,
> 
Hi again,

> Thanks for the work on this -- this is obviously a very complicated area 
> to try to make sense of.  Of course, that makes it even more complicated 
> to review -- sorry this has taken so long.  (comments inline)
>
Again, no problem at all!

> > +
> > +=item B<auto>
> > +
> > +use a not better specified (xl-implementation dependant) algorithm
> > +to try automatically fitting the guest on the host's NUMA nodes
> Uum, you mean, "Use the default placement algorithm"? :-)  We should 
> specify which one this will be here.
> > +
> > +/* Automatic placement policies symbols, with the following meaning:
> > + *  NONE   : no automatic placement at all;
> > + *  STATIC : explicit nodes specification.
> > + *  FFIT   : use the First Fit algorithm for automatic placement;
> > + *  AUTO   : an not better specified automatic placement, likely
> > + *           to be implemented as a short circuit to (one of)
> > + *           the above(s);
> > + */
> > +#define NODES_POLICY_NONE    0
> > +#define NODES_POLICY_STATIC  1
> > +#define NODES_POLICY_FFIT    2
> > +#define NODES_POLICY_AUTO    3
> This is minor, but it seems like "auto" should be 1, so if we add 
> another policy, the policies are all together, without having to move 
> AUTO around every time.
>
Ok, I'll reorganize this bit to be more sensible, which also includes
changing the names of the policies and some others cleanups.

> > +
> > +/* Store the policy for the domain while parsing */
> > +static int nodes_policy = NODES_POLICY_DEFAULT;
> > +
> > +/* Store the number of nodes to be used while parsing */
> > +static int num_nodes_policy = 0;
> Why are "nodes_policy" and "num_nodes_policy" not passed in along with 
> b_info?
>
That was my first implementation. Then I figured out that I want to do
the placement in _xl_, not in _libxl_, so I really don't need to muck up
build info with placement related stuff. Should I use b_info anyway,
even if I don't need these fields while in libxl?

> > +static int nodes_policy_parse(const char *pol, long int *policy)
> > +{
> > +    int i;
> > +    const char *n;
> > +
> > +    for (i = 0; i<  sizeof(nodes_policy_names) / sizeof(nodes_policy_names[0]); i++) {
> I personally prefer an explicit NODES_POLICY_MAX, but I'll let the libxl 
> maintainers decide on that.
>
Sounds definitely nicer. I just did it like that because I found a very
similar example in xl itself, but I'm open about changing this to
whatever you and libxl maintainers reach a consensus on. :-)

> > +
> > +        /* Determine how much free memory we want on each of the nodes
> > +         * that will end up in suitable_nodes. Either adding or ignoring
> > +         * the modulus of the integer division should be fine (the total
> > +         * number of nodes should be in the order of tens of them), so
> > +         * let's add it as it should be more safe.
> > +         */
> > +        memb_node = (memb / (*nodes)) + (memb % (*nodes));
> Wouldn't it just be easier to do a "round-up" function here, like this:
>   memb_node = ( (memb + *nodes -1) / (*nodes) )
> 
Yep, it probably is, thanks. :-)

> Also, is it really necessary for a VM to have an equal amount of memory 
> on every node? It seems like it would be better to have 75% on one node 
> and 25% on a second node than to have 25% on four nodes, for example.  
> Furthermore, insisting on an even amount fragments the node memory 
> further -- i.e., if we chose to have 25% on four nodes instead of 75% on 
> one and 25% on another, that will make it harder for another VM to fit 
> on a single node as well.
> 
Ok, that is something quite important to discuss. What you propose makes
a lot of sense, although some issues comes to my mind:

- which percent should I try, and in what order? I mean, 75%/25%
   sounds reasonable, but maybe also 80%/20% or even 60%/40% helps your 
   point. So, should I just decide an asymmetrical layout and try it 
   (instead of 50%/50%) or do you have some sort of "trial and error"
   approach, with different layouts being attempted one after the other,
   in mind? The problem is very complex and finding a (good) solution
   will easily start depending on which layout we try first, in which
   order we try the other ones (if at all) and on the free memory we
   have on the various nodes... We can probably state it as an integer
   optimization problem, but I'm not sure we want to put a GLPK (GNU
   Linear Programming Kit) solver within xl! :-O

- suppose I go for 75%/25%, what about the scheduling oof the VM? 
   Right now, with 50%/50% I can set node affinity to both the nodes and
   rest ensured things won't be that bad. I'll be getting something
   in the middle between best and worst performances, as the
   benchmarks show. Doing the same with 75%/25% means I can be
   instructing the scheduler to run the VM on a node where it has far
   fewer probability of accessing local memory than the other one. OTOH,
   if setting node affinity to just the node with 75% of the mem, I'm
   risking always generating 25% of remote accesses, which will keep us
   apart from the best case performance, isn't it?

- what about the asymmetrical layout of choice (let's stick to 75%/25%)
   does not fit anywhere and I really need one more (or one less)
   node... I mean, how do I partition memory in that case?

Please, don't get me wrong, I see your point and really think it makes
sense. I've actually thought along the same line for a while, but then I
couldn't find an answers to the questions above.

That's why, kind of falling back with Xen's default "striped" approach
(although on as less nodes as possible, which is _much_ better than the
actual Xen's default!). It looked simple enough to write, read and
understand, while still providing statistically consistent performances.

If we, together, manage in sorting out the issues, I'm all for
asymmetrical placement myself! :-D

> The need for NODES_POLICY_RETRY_DEC is an artifact of the above; if 
> nodes were allowed to be assymetric, you wouldn't need to *decrease* the 
> number of nodes to find *more* memory.
>
I agree, let's try figure out how we think the heuristics should look
like, ok? That being done, I'll be happy to kill RETRY_DEC if it turns
out to be useless! :-P

> > +        /* Apply the asked retry policy for the next round. Notice
> > +         * that it would be pointless to increase nodes without also
> > +         * adding some nodes to the map! */
> > +        *nodes += retry;
> > +        if (retry == NODES_POLICY_RETRY_INC)
> > +            __add_nodes_to_nodemap(nodemap, numa, nr_nodes, retry);
> Hmm -- if I'm reading this right, the only time the nodemap won't be all 
> nodes is if (1) the user specified nodes, or (2) there's a cpumask in 
> effect.  If we're going to override that setting, wouldn't it make sense 
> to just expand to all numa nodes?
> 
As you wish, the whole "what to do if what I've been provided with
doesn't work" is in the *wild guess* status, meaning I tried to figure
out what would be best to do, but I might well be far from the actual
correct solution, provided there is one.

Trying to enlarge the nodemap step by step is potentially yielding
better performances, but is probably not so near to the "least surprise"
principle one should use when designing UIs. :-(

> Hmm -- though I suppose what you'd really want to try is adding each 
> node in turn, rather than one at a time (i.e., if the cpus are pinned to 
> nodes 2 and 3, and [2,3] doesn't work, try [1,2,3] and [2,3,4] before 
> trying [1,2,3,4].  
>
Yep, that makes a real lot of sense, thanks! I can definitely try doing
that, although it will complicate the code a bit...

> But that's starting to get really complicated -- I 
> wonder if it's better to just fail and let the user change the pinnings 
> / configuration node mapping.
>
Well, that will probably be the least surprising behaviour.

Again, just let me know what you think it's best among the various
alternatives and I'll go for it.

> > +
> > +        if (use_cpus>= b_info->max_vcpus) {
> > +            rc = 0;
> > +            break;
> > +        }
> Hmm -- there's got to be a better way to find out the minimum number of 
> nodes to house a given number of vcpus than just starting at 1 and 
> re-trying until we have enough.
> > +        /* Add one more node and retry fitting the domain */
> > +        __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1);
> Same comment as above.
>
I'm not sure I'm getting this. The whole point here is let's consider
free memory on the various nodes first, and then adjust the result if
some other constraints are being violated.

However, if what you mean is I could check beforehand whether or not the
user provided configuration will give us enough CPUs and avoid testing
scenarios that are guaranteed to fail, then I agree and I'll reshape the
code to look like that. This triggers the heuristics re-designing stuff
from above again, as one have to decide what to do if user asks for
"nodes=[1,3]" and I discover (earlier) that I need one more node for
having enough CPUs (I mean, what node should I try first?).

> >
> > +    ret = place_domain(&d_config.b_info);
> > +    if (ret == ESRCH || ret == ENOSPC) {
> > +        fprintf(stderr, "failed to place the domain, fallback to all nodes\n");
> > +        libxl_nodemap_set_any(&d_config.b_info.nodemap);
> Hmm -- is this the right fallback?  I think in general, if the user asks 
> for X to be done, and X cannot be done for whatever reason, the right 
> thing is to tell the user that X cannot be done and let them sort it 
> out, rather than resorting to Y (unless that's been set).
> 
I agree and I will go for this.

> Is it the case that if any node placement fails, then they'll all fail?  
> Or to put it the other way, is it the case that if there is *some* 
> placement, that any placement algorithm will eventually find it?  If so, 
> then I think this fallback may make sense, as there's nothing for the 
> user to really do.
>
Well, that's tricky. I  mean, if we were only talking about fitting all
the VM's memory on one single node, then yes, no matter how you order
the list of nodes, if there is at least one with enough space for the
VM, you'll find it. However, we might be dealing with "fit the VM in
multiple nodes" scenarios, which is a completely different thing, and
changing the number of nodes onto which you're trying to fit the VM will
change pretty much everything.

So, I'm not entirely sure I answered your question but the point is your
idea above is the best one: if you ask something and we don't manage in
getting it done, just stop and let you figure things out.
I've only one question about this approach, what if the automatic
placement is/becomes the default? I mean, avoiding any kind of fallback
(which again, makes sense to me in case the user is explicitly asking
something specific) would mean a completely NUMA-unaware VM creation can
be aborted even if the user did not say anything... How do we deal with
this?

> > diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> > --- a/xen/arch/x86/numa.c
> > +++ b/xen/arch/x86/numa.c
> > ...
> > 
> This should be in its own patch.
> 
Ok.

Thanks  lot again for taking a look!

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel