From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dario Faggioli <raistlin@linux.it>
Subject: Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit
 memory-wise placement of guests on nodes
Date: Thu, 03 May 2012 16:58:51 +0200
Message-ID: <1336057131.2313.82.camel@Abyss>
References: <patchbomb.1334150267@Solace>
	<31163f014d8a2da9375f.1334150275@Solace>
	<4FA0051F.6020909@eu.citrix.com>
	<1335976251.2961.123.camel@Abyss> <4FA28B1B.8040205@eu.citrix.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============3055614114624587010=="
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <4FA28B1B.8040205@eu.citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Andre Przywara <andre.przywara@amd.com>, Ian Campbell <Ian.Campbell@citrix.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Jan Beulich <JBeulich@suse.com>
List-Id: xen-devel@lists.xenproject.org


--===============3055614114624587010==
Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature";
	boundary="=-6C60f7a+kgUahT5xki3b"


--=-6C60f7a+kgUahT5xki3b
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, 2012-05-03 at 14:41 +0100, George Dunlap wrote:=20
> >> Why are "nodes_policy" and "num_nodes_policy" not passed in along with
> >> b_info?
> > That was my first implementation. Then I figured out that I want to do
> > the placement in _xl_, not in _libxl_, so I really don't need to muck u=
p
> > build info with placement related stuff. Should I use b_info anyway,
> > even if I don't need these fields while in libxl?
> Ah right -- yeah, probably since b_info is a libxl structure, you=20
> shouldn't add it in there.  But in that case you should probably add=20
> another xl-specific structure and pass it through, rather than having=20
> global variables, I think.  It's only used in the handful of placement=
=20
> functions, right?
>
It is and yes, I can and will try to "pack" it better. Thanks.

> > - which percent should I try, and in what order? I mean, 75%/25%
> >     sounds reasonable, but maybe also 80%/20% or even 60%/40% helps you=
r
> >     point.
> I had in mind no constraints at all on the ratios -- basically, if you=
=20
> can find N nodes such that the sum of free memory is enough to create=20
> the VM, even 99%/1%, then go for that rather than looking for N+1. =20
> Obviously finding a more balanced option would be better.=20
>
I see, and this looks both reasonable and doable with reasonable
effort. :-)

> One option=20
> would be to scan through finding all sets of N nodes that will satisfy=
=20
> the criteria, and then choose the most "balanced" one.  That might be=20
> more than we need for 4.2, so another option would be to look for evenly=
=20
> balanced nodes first, then if we don't find a set, look for any set. =20
> (That certainly fits with the "first fit" description!)
>
Yep, that would need some extra effort that we might want to postpone. I
like what you proposed above and I'll look into putting it together
without turning xl too much into a set-theory processing machine. ;-D

> > - suppose I go for 75%/25%, what about the scheduling oof the VM?
> Haha -- yeah, for a research paper, you'd probably implement some kind=
=20
> of lottery scheduling algorithm that would schedule it on one node 75%=
=20
> of the time and another node 25% of the time. :-) =20
>
That would be awesome. I know a bunch of people that would manage in
getting at leat 3 or 4 publications out of this idea (one about the
idea, one about the implementation, one about the statistical
guarantees, etc.). :-P :-P

> But I think that just=20
> making the node affinity equal to both of them will be good enough for=
=20
> now.  There will be some variability in performance, but there will be=
=20
> some of that anyway depending on what node's memory the guest happens to=
=20
> use more.
>
Agreed. I'm really thinking about taking your suggestion about keeping
thee layout "fluid" and then set the affinity to all of the involved
nodes. Thanks again.

> This actually kind of a different issue, but I'll bring it up now=20
> because it's related.  (Something to toss in for thinking about in 4.3=
=20
> really.)  Suppose there are 4 cores and 16GiB per node, and a VM has 8=
=20
> vcpus and 8GiB of RAM.  The algorithm you have here will attempt to put=
=20
> 4GiB on each of two nodes (since it will take 2 nodes to get 8 cores). =
=20
> However, it's pretty common for larger VMs to have way more vcpus than=
=20
> they actually use at any one time.  So it might actually have better=20
> performance to put all 8GiB on one node, and set the node affinity=20
> accordingly.  In the rare event that more than 4 vcpus are active, a=20
> handful of vcpus will have all remote accesses, but the majority of the=
=20
> time, all of the cpus will have local accesses.  (OTOH, maybe that=20
> should be only a policy thing that we should recommend in the=20
> documentation...)
>=20
That makes a lot of sense, and I was also thinking along the very same
line. Once we'll have a proper mechanism to account for CPU load we
could put another policing re this thing, i.e., once we've got all the
information about the memory, what should we do with respect to (v)cpus?
Should we spread the load or prefer some more "packed" placement? Again,
I think we can both add more tunables and also make things dynamic
(e.g., have all the 8 vcpus on the 4 cpus of a node and, if they're
generating too much remote accesses, expand the node affinity and move
some of the memory) when we'll have more of the mechanism in place.

> Dude, this is open source.  Be opinionated. ;-)
>=20
Thanks for the advice... From now on, I'll do my best! :-)

> What do you think of my suggestions above?
>
As I said, I actually like it.

> I think if the user specifies a nodemap, and that nodemap doesn't have=
=20
> enough memory, we should throw an error.
>=20
Agreed.

> If there's a node_affinity set, no memory on that node, but memory on a=
=20
> *different* node, what will Xen do?  It will allocate memory on some=20
> other node, right?
>=20
Yep.

> So ATM even if you specify a cpumask, you'll get memory on the masked=20
> nodes first, and then memory elsewhere (probably in a fairly random=20
> manner); but as much of the memory as possible will be on the masked=20
> nodes.  I wonder then if we shouldnt' just keep that behavior -- i.e.,=
=20
> if there's a cpumask specified, just return the nodemask from that mask,=
=20
> and let Xen put as much as possible on that node and let the rest fall=
=20
> where it may.
>=20
Well, sounds reasonable after all, and I'm pretty sure it would simplify
the code a bit, which is something not bad at all. So yes, I'm inclined
to agree on this.

> > However, if what you mean is I could check beforehand whether or not th=
e
> > user provided configuration will give us enough CPUs and avoid testing
> > scenarios that are guaranteed to fail, then I agree and I'll reshape th=
e
> > code to look like that. This triggers the heuristics re-designing stuff
> > from above again, as one have to decide what to do if user asks for
> > "nodes=3D[1,3]" and I discover (earlier) that I need one more node for
> > having enough CPUs (I mean, what node should I try first?).
> No, that's not exactly what I meant.  Suppose there are 4 cores per=20
> node, and a VM has 16 vcpus, and NUMA is just set to auto, with no other=
=20
> parameters.  If I'm reading your code right, what it will do is first=20
> try to find a set of 1 node that will satisfy the constraints, then 2=20
> nodes, then 3, nodes, then 4, &c.  Since there are at most 4 cores per=
=20
> node, we know that 1, 2, and 3 nodes are going to fail, regardless of=20
> how much memory there is or how many cpus are offline.  So why not just=
=20
> start with 4, if the user hasn't specified anything?  Then if 4 doesn't=
=20
> work (either because there's not enough memory, or some of the cpus are=
=20
> offline), then we can start bumping it up to 5, 6, &c.
>=20
Fine, I got it now, thanks for clarifying (both here and in the other
e-mail). I'll look into that and, if it doesn't cost too much
thinking/coding/debugging time, I'll go for it.

> That's what I was getting at -- but again, if it makes it too=20
> complicated, trading a bit of extra passes for a significant chunk of=20
> your debugging time is OK. :-)
>=20
I'll sure act like this. I really want this thing out ASAP.

> Well, if the user didn't specify anything, then we can't contradict=20
> anything he specified, right? :-)  If the user doesn't specify anything,=
=20
> and the default is "numa=3Dauto", then I think we're free to do whatever=
=20
> we think is best regarding NUMA placement; in fact, I think we should=20
> try to avoid failing VM creation if it's at all possible.  I just meant=
=20
> what I think we should do if the user asked for specific NUMA nodes or a=
=20
> specific number of nodes.
>
I agree, and it should be easy enough to discern in which of these two
situations we are and act accordingly.

> It seems like we have a number of issues here that would be good for=20
> more people to come in on --=20
>
This is something I really would enjoy a lot!

> what if I attempt to summarize the=20
> high-level decisions we're talking about so that it's easier for more=20
> people to comment on them?
>=20
That would be very helpful, indeed.

Thanks and Regards,
Dario

--=20
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


--=-6C60f7a+kgUahT5xki3b
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEABECAAYFAk+inSsACgkQk4XaBE3IOsSuFgCfaAf0x++ZGvjHfMIDIRJto3J5
AyIAn0La1PbL7QsWSLOZSGK+8xBc8hXI
=DeiO
-----END PGP SIGNATURE-----

--=-6C60f7a+kgUahT5xki3b--


--===============3055614114624587010==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--===============3055614114624587010==--