Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages]

From: Dario Faggioli <raistlin@linux.it>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	xen-devel <xen-devel@lists.xen.org>
Subject: Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages]
Date: Wed, 18 Jul 2012 02:22:27 +0200	[thread overview]
Message-ID: <1342570947.11794.92.camel@Abyss> (raw)
In-Reply-To: <20485.43293.833036.352186@mariner.uk.xensource.com>

[-- Attachment #1.1: Type: text/plain, Size: 3856 bytes --]

On Tue, 2012-07-17 at 19:04 +0100, Ian Jackson wrote: 
> I've been told that boxes with 16 NUMA nodes are coming out about
> now.
> 
> This algorithm will start to have unreasonable performance with
> anything bigger than that.  Eg choosing 16 NUMA nodes out of 32 would
> involve searching 601,080,390 combinations.  32 out of 64 gives 10^18.
> 
Ok, when designing this thing, I knew it would turn out to be the most
computationally complex part of the whole NUMA thing (hopefully!).

That was on purpose, mainly because this is the *first* (from the
guest's lifetime POV) of the various others NUMA-aware related
heuristics that will come, and the only one that will run userspace and
far from critical paths. That's why, when the suggestion to consider
"all the possible combination of ..." came, during one round of the
review, I liked it very much and went for it.

_That_all_being_said_, the algorithm takes some input parameters, the
most important of which is probably max_nodes. It should contain the
maximum number of nodes one wants his guest to span. I worked really
hard for it not to not be just hardcoded to 1, as that would mean we
were giving up trying to improve performances for guests not fitting in
just one node. However, it is not written in stone that it must be free
to range from 1 to #_of_nodes. That is sure possible, with the current
NUMA systems, and here they are the number of combinations we get for 4,
8 and 16 nodes (n), respectively: 
$ echo "combs=0; n=4; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q
  combs =  15
 $ echo "combs=0; n=8; for k=1:8 combs=combs+bincoeff(n,k); endfor; combs" | octave -q
  combs =  255
 $ echo "combs=0; n=16; for k=1:16 combs=combs+bincoeff(n,k); endfor; combs" | octave -q
  combs =  65535

However, it might well be decided that it is sane to limit max_nodes to,
say, 4, which would mean we'll be trying our best to boost the
performances of all guests that fits within 4 nodes (which is a lot
better than just one!), which would result in these numbers, for nodes
(n) equal to 4, 8, 16, 32 and 64: 
$ echo "combs=0; n=4; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q
  combs =  15
 $ echo "combs=0; n=8; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q
  combs =  162
 $ echo "combs=0; n=16; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q
  combs =  2516
 $ echo "combs=0; n=32; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q
  combs =  41448
 $ echo "combs=0; n=64; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q
  combs =  679120

Moreover, as another input parameter is a nodemask, telling the
algorithm not only how many but also _which_ nodes it should consider,
one could imagine to have some multi-tier form of the algorithm itself,
or, perhaps, implementing a more abstract and coarse partitioning
heuristics among group of nodes (which will better be done as soon as
we'll see how 32 and 64 nodes system will actually look like :-D) and
then run the current algorithm only on these groups, with numbers that
would then look just like above.

So, to summarize, I can't be farther than saying the current algorithm
is perfect or fast. Rather, I'm definitely saying it has the advantage
of not branching out the problem space too much aggressively at this
early stage, and it is flexible enough for avoiding getting unreasonable
performances, or at least, it looks like that to me, I guess. :-)

Thoughts?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel