From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] Date: Wed, 18 Jul 2012 02:22:27 +0200 Message-ID: <1342570947.11794.92.camel@Abyss> References: <0411b2cebd725b193465.1341932614@Solace> <20485.35590.105351.434937@mariner.uk.xensource.com> <5fa66c8b9093399e5bc3.1342458792@Solace> <20485.43293.833036.352186@mariner.uk.xensource.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============7183252895715778931==" Return-path: In-Reply-To: <20485.43293.833036.352186@mariner.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Jackson Cc: Andre Przywara , Ian Campbell , Stefano Stabellini , George Dunlap , Juergen Gross , xen-devel List-Id: xen-devel@lists.xenproject.org --===============7183252895715778931== Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature"; boundary="=-4ydyJVh5DgbkEHwxaj/B" --=-4ydyJVh5DgbkEHwxaj/B Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, 2012-07-17 at 19:04 +0100, Ian Jackson wrote:=20 > I've been told that boxes with 16 NUMA nodes are coming out about > now. >=20 > This algorithm will start to have unreasonable performance with > anything bigger than that. Eg choosing 16 NUMA nodes out of 32 would > involve searching 601,080,390 combinations. 32 out of 64 gives 10^18. >=20 Ok, when designing this thing, I knew it would turn out to be the most computationally complex part of the whole NUMA thing (hopefully!). That was on purpose, mainly because this is the *first* (from the guest's lifetime POV) of the various others NUMA-aware related heuristics that will come, and the only one that will run userspace and far from critical paths. That's why, when the suggestion to consider "all the possible combination of ..." came, during one round of the review, I liked it very much and went for it. _That_all_being_said_, the algorithm takes some input parameters, the most important of which is probably max_nodes. It should contain the maximum number of nodes one wants his guest to span. I worked really hard for it not to not be just hardcoded to 1, as that would mean we were giving up trying to improve performances for guests not fitting in just one node. However, it is not written in stone that it must be free to range from 1 to #_of_nodes. That is sure possible, with the current NUMA systems, and here they are the number of combinations we get for 4, 8 and 16 nodes (n), respectively:=20 $ echo "combs=3D0; n=3D4; for k=3D1:4 combs=3Dcombs+bincoeff(n,k); endfor; = combs" | octave -q combs =3D 15 $ echo "combs=3D0; n=3D8; for k=3D1:8 combs=3Dcombs+bincoeff(n,k); endfor;= combs" | octave -q combs =3D 255 $ echo "combs=3D0; n=3D16; for k=3D1:16 combs=3Dcombs+bincoeff(n,k); endfo= r; combs" | octave -q combs =3D 65535 However, it might well be decided that it is sane to limit max_nodes to, say, 4, which would mean we'll be trying our best to boost the performances of all guests that fits within 4 nodes (which is a lot better than just one!), which would result in these numbers, for nodes (n) equal to 4, 8, 16, 32 and 64:=20 $ echo "combs=3D0; n=3D4; for k=3D1:4 combs=3Dcombs+bincoeff(n,k); endfor; = combs" | octave -q combs =3D 15 $ echo "combs=3D0; n=3D8; for k=3D1:4 combs=3Dcombs+bincoeff(n,k); endfor;= combs" | octave -q combs =3D 162 $ echo "combs=3D0; n=3D16; for k=3D1:4 combs=3Dcombs+bincoeff(n,k); endfor= ; combs" | octave -q combs =3D 2516 $ echo "combs=3D0; n=3D32; for k=3D1:4 combs=3Dcombs+bincoeff(n,k); endfor= ; combs" | octave -q combs =3D 41448 $ echo "combs=3D0; n=3D64; for k=3D1:4 combs=3Dcombs+bincoeff(n,k); endfor= ; combs" | octave -q combs =3D 679120 Moreover, as another input parameter is a nodemask, telling the algorithm not only how many but also _which_ nodes it should consider, one could imagine to have some multi-tier form of the algorithm itself, or, perhaps, implementing a more abstract and coarse partitioning heuristics among group of nodes (which will better be done as soon as we'll see how 32 and 64 nodes system will actually look like :-D) and then run the current algorithm only on these groups, with numbers that would then look just like above. So, to summarize, I can't be farther than saying the current algorithm is perfect or fast. Rather, I'm definitely saying it has the advantage of not branching out the problem space too much aggressively at this early stage, and it is flexible enough for avoiding getting unreasonable performances, or at least, it looks like that to me, I guess. :-) Thoughts? Thanks and Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-4ydyJVh5DgbkEHwxaj/B Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEABECAAYFAlAGAcMACgkQk4XaBE3IOsRTvQCfYVB5OBtN9sI98hn5wPJmyuuP 558AoIHuCriiOjsz/epsdJ8h8Ifrj08L =sPni -----END PGP SIGNATURE----- --=-4ydyJVh5DgbkEHwxaj/B-- --===============7183252895715778931== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============7183252895715778931==--