From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes Date: Wed, 02 May 2012 18:30:51 +0200 Message-ID: <1335976251.2961.123.camel@Abyss> References: <31163f014d8a2da9375f.1334150275@Solace> <4FA0051F.6020909@eu.citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6837791055166636165==" Return-path: In-Reply-To: <4FA0051F.6020909@eu.citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: George Dunlap Cc: Andre Przywara , Ian Campbell , Stefano Stabellini , Juergen Gross , Ian Jackson , "xen-devel@lists.xen.org" , Jan Beulich List-Id: xen-devel@lists.xenproject.org --===============6837791055166636165== Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature"; boundary="=-VITkHo9S8QdJpmzIjUfZ" --=-VITkHo9S8QdJpmzIjUfZ Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, 2012-05-01 at 16:45 +0100, George Dunlap wrote:=20 > Hey Dario, >=20 Hi again, > Thanks for the work on this -- this is obviously a very complicated area= =20 > to try to make sense of. Of course, that makes it even more complicated= =20 > to review -- sorry this has taken so long. (comments inline) > Again, no problem at all! > > + > > +=3Ditem B > > + > > +use a not better specified (xl-implementation dependant) algorithm > > +to try automatically fitting the guest on the host's NUMA nodes > Uum, you mean, "Use the default placement algorithm"? :-) We should=20 > specify which one this will be here. > > + > > +/* Automatic placement policies symbols, with the following meaning: > > + * NONE : no automatic placement at all; > > + * STATIC : explicit nodes specification. > > + * FFIT : use the First Fit algorithm for automatic placement; > > + * AUTO : an not better specified automatic placement, likely > > + * to be implemented as a short circuit to (one of) > > + * the above(s); > > + */ > > +#define NODES_POLICY_NONE 0 > > +#define NODES_POLICY_STATIC 1 > > +#define NODES_POLICY_FFIT 2 > > +#define NODES_POLICY_AUTO 3 > This is minor, but it seems like "auto" should be 1, so if we add=20 > another policy, the policies are all together, without having to move=20 > AUTO around every time. > Ok, I'll reorganize this bit to be more sensible, which also includes changing the names of the policies and some others cleanups. > > + > > +/* Store the policy for the domain while parsing */ > > +static int nodes_policy =3D NODES_POLICY_DEFAULT; > > + > > +/* Store the number of nodes to be used while parsing */ > > +static int num_nodes_policy =3D 0; > Why are "nodes_policy" and "num_nodes_policy" not passed in along with= =20 > b_info? > That was my first implementation. Then I figured out that I want to do the placement in _xl_, not in _libxl_, so I really don't need to muck up build info with placement related stuff. Should I use b_info anyway, even if I don't need these fields while in libxl? > > +static int nodes_policy_parse(const char *pol, long int *policy) > > +{ > > + int i; > > + const char *n; > > + > > + for (i =3D 0; i< sizeof(nodes_policy_names) / sizeof(nodes_policy= _names[0]); i++) { > I personally prefer an explicit NODES_POLICY_MAX, but I'll let the libxl= =20 > maintainers decide on that. > Sounds definitely nicer. I just did it like that because I found a very similar example in xl itself, but I'm open about changing this to whatever you and libxl maintainers reach a consensus on. :-) > > + > > + /* Determine how much free memory we want on each of the nodes > > + * that will end up in suitable_nodes. Either adding or ignori= ng > > + * the modulus of the integer division should be fine (the tot= al > > + * number of nodes should be in the order of tens of them), so > > + * let's add it as it should be more safe. > > + */ > > + memb_node =3D (memb / (*nodes)) + (memb % (*nodes)); > Wouldn't it just be easier to do a "round-up" function here, like this: > memb_node =3D ( (memb + *nodes -1) / (*nodes) ) >=20 Yep, it probably is, thanks. :-) > Also, is it really necessary for a VM to have an equal amount of memory= =20 > on every node? It seems like it would be better to have 75% on one node= =20 > and 25% on a second node than to have 25% on four nodes, for example. = =20 > Furthermore, insisting on an even amount fragments the node memory=20 > further -- i.e., if we chose to have 25% on four nodes instead of 75% on= =20 > one and 25% on another, that will make it harder for another VM to fit= =20 > on a single node as well. >=20 Ok, that is something quite important to discuss. What you propose makes a lot of sense, although some issues comes to my mind: - which percent should I try, and in what order? I mean, 75%/25% sounds reasonable, but maybe also 80%/20% or even 60%/40% helps your=20 point. So, should I just decide an asymmetrical layout and try it=20 (instead of 50%/50%) or do you have some sort of "trial and error" approach, with different layouts being attempted one after the other, in mind? The problem is very complex and finding a (good) solution will easily start depending on which layout we try first, in which order we try the other ones (if at all) and on the free memory we have on the various nodes... We can probably state it as an integer optimization problem, but I'm not sure we want to put a GLPK (GNU Linear Programming Kit) solver within xl! :-O - suppose I go for 75%/25%, what about the scheduling oof the VM?=20 Right now, with 50%/50% I can set node affinity to both the nodes and rest ensured things won't be that bad. I'll be getting something in the middle between best and worst performances, as the benchmarks show. Doing the same with 75%/25% means I can be instructing the scheduler to run the VM on a node where it has far fewer probability of accessing local memory than the other one. OTOH, if setting node affinity to just the node with 75% of the mem, I'm risking always generating 25% of remote accesses, which will keep us apart from the best case performance, isn't it? - what about the asymmetrical layout of choice (let's stick to 75%/25%) does not fit anywhere and I really need one more (or one less) node... I mean, how do I partition memory in that case? Please, don't get me wrong, I see your point and really think it makes sense. I've actually thought along the same line for a while, but then I couldn't find an answers to the questions above. That's why, kind of falling back with Xen's default "striped" approach (although on as less nodes as possible, which is _much_ better than the actual Xen's default!). It looked simple enough to write, read and understand, while still providing statistically consistent performances. If we, together, manage in sorting out the issues, I'm all for asymmetrical placement myself! :-D > The need for NODES_POLICY_RETRY_DEC is an artifact of the above; if=20 > nodes were allowed to be assymetric, you wouldn't need to *decrease* the= =20 > number of nodes to find *more* memory. > I agree, let's try figure out how we think the heuristics should look like, ok? That being done, I'll be happy to kill RETRY_DEC if it turns out to be useless! :-P > > + /* Apply the asked retry policy for the next round. Notice > > + * that it would be pointless to increase nodes without also > > + * adding some nodes to the map! */ > > + *nodes +=3D retry; > > + if (retry =3D=3D NODES_POLICY_RETRY_INC) > > + __add_nodes_to_nodemap(nodemap, numa, nr_nodes, retry); > Hmm -- if I'm reading this right, the only time the nodemap won't be all= =20 > nodes is if (1) the user specified nodes, or (2) there's a cpumask in=20 > effect. If we're going to override that setting, wouldn't it make sense= =20 > to just expand to all numa nodes? >=20 As you wish, the whole "what to do if what I've been provided with doesn't work" is in the *wild guess* status, meaning I tried to figure out what would be best to do, but I might well be far from the actual correct solution, provided there is one. Trying to enlarge the nodemap step by step is potentially yielding better performances, but is probably not so near to the "least surprise" principle one should use when designing UIs. :-( > Hmm -- though I suppose what you'd really want to try is adding each=20 > node in turn, rather than one at a time (i.e., if the cpus are pinned to= =20 > nodes 2 and 3, and [2,3] doesn't work, try [1,2,3] and [2,3,4] before=20 > trying [1,2,3,4]. =20 > Yep, that makes a real lot of sense, thanks! I can definitely try doing that, although it will complicate the code a bit... > But that's starting to get really complicated -- I=20 > wonder if it's better to just fail and let the user change the pinnings= =20 > / configuration node mapping. > Well, that will probably be the least surprising behaviour. Again, just let me know what you think it's best among the various alternatives and I'll go for it. > > + > > + if (use_cpus>=3D b_info->max_vcpus) { > > + rc =3D 0; > > + break; > > + } > Hmm -- there's got to be a better way to find out the minimum number of= =20 > nodes to house a given number of vcpus than just starting at 1 and=20 > re-trying until we have enough. > > + /* Add one more node and retry fitting the domain */ > > + __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1); > Same comment as above. > I'm not sure I'm getting this. The whole point here is let's consider free memory on the various nodes first, and then adjust the result if some other constraints are being violated. However, if what you mean is I could check beforehand whether or not the user provided configuration will give us enough CPUs and avoid testing scenarios that are guaranteed to fail, then I agree and I'll reshape the code to look like that. This triggers the heuristics re-designing stuff from above again, as one have to decide what to do if user asks for "nodes=3D[1,3]" and I discover (earlier) that I need one more node for having enough CPUs (I mean, what node should I try first?). > > > > + ret =3D place_domain(&d_config.b_info); > > + if (ret =3D=3D ESRCH || ret =3D=3D ENOSPC) { > > + fprintf(stderr, "failed to place the domain, fallback to all n= odes\n"); > > + libxl_nodemap_set_any(&d_config.b_info.nodemap); > Hmm -- is this the right fallback? I think in general, if the user asks= =20 > for X to be done, and X cannot be done for whatever reason, the right=20 > thing is to tell the user that X cannot be done and let them sort it=20 > out, rather than resorting to Y (unless that's been set). >=20 I agree and I will go for this. > Is it the case that if any node placement fails, then they'll all fail? = =20 > Or to put it the other way, is it the case that if there is *some*=20 > placement, that any placement algorithm will eventually find it? If so,= =20 > then I think this fallback may make sense, as there's nothing for the=20 > user to really do. > Well, that's tricky. I mean, if we were only talking about fitting all the VM's memory on one single node, then yes, no matter how you order the list of nodes, if there is at least one with enough space for the VM, you'll find it. However, we might be dealing with "fit the VM in multiple nodes" scenarios, which is a completely different thing, and changing the number of nodes onto which you're trying to fit the VM will change pretty much everything. So, I'm not entirely sure I answered your question but the point is your idea above is the best one: if you ask something and we don't manage in getting it done, just stop and let you figure things out. I've only one question about this approach, what if the automatic placement is/becomes the default? I mean, avoiding any kind of fallback (which again, makes sense to me in case the user is explicitly asking something specific) would mean a completely NUMA-unaware VM creation can be aborted even if the user did not say anything... How do we deal with this? > > diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c > > --- a/xen/arch/x86/numa.c > > +++ b/xen/arch/x86/numa.c > > ... > >=20 > This should be in its own patch. >=20 Ok. Thanks lot again for taking a look! Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-VITkHo9S8QdJpmzIjUfZ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEABECAAYFAk+hYTsACgkQk4XaBE3IOsQcaQCeOwhFzvuKoNeM2IXEgAhh9TNC 7CoAn3AnzYx4y7K/bzJf1GskruO6Ygaq =etUB -----END PGP SIGNATURE----- --=-VITkHo9S8QdJpmzIjUfZ-- --===============6837791055166636165== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============6837791055166636165==--