Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes

From: Dario Faggioli <raistlin@linux.it>
To: Andre Przywara <andre.przywara@amd.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	xen-devel <xen-devel@lists.xen.org>
Subject: Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes
Date: Thu, 19 Jul 2012 16:22:51 +0200	[thread overview]
Message-ID: <1342707771.19530.235.camel@Solace> (raw)
In-Reply-To: <5007FBCE.6000201@amd.com>

[-- Attachment #1.1: Type: text/plain, Size: 5310 bytes --]

On Thu, 2012-07-19 at 14:21 +0200, Andre Przywara wrote:
> Dario,
> 
Hi Andre!

> sorry for joining the discussion so late, but I was busy with other 
> things and saw the project in good hands.
>
Well, thanks. It's being a big deal (and we're just at the beginning!),
but I'm enjoying working on it so I won't give up easily! :-P

> Finally I managed to get some testing on these patches.
> 
That's very cool, thanks.

> I took my 8-node machine, alternately equipped with 16GB and 8GB per 
> node. Each node has 8 pCPUs.
> As a special(i)ty I removed the DIMMs from node 2 and 3 to test Andrew's 
> memory-less node patches, leading to this configuration:
> node:    memsize    memfree    distances
>     0:     17280       4518      10,16,16,22,16,22,16,22
>     1:      8192       3639      16,10,16,22,22,16,22,16
>     2:         0          0      16,16,10,16,16,16,16,22
>     3:         0          0      22,22,16,10,16,16,22,16
>     4:     16384       4766      16,22,16,16,10,16,16,16
>     5:      8192       2882      22,16,16,16,16,10,22,22
>     6:     16384       4866      16,22,16,22,16,22,10,16
>     7:      8176       2799      22,16,22,16,16,22,16,10
> 
Interesting. That's really the kind of testing we need in order to
fine-tune the details. Thanks for doing this.

> Then I started 32 guests, each 4 vCPUs and 1 GB of RAM.
> Now since the code prefers free memory so much over free CPUs, the 
> placement was the following:
> node0: guests 2,5,8,11,14,17,20,25,30
> node1: guests 21,27
> node2: none
> node3: none
> node4: guests 1,4,7,10,13,16,19,23,29
> node5: guests 24,31
> node6: guests 3,6,9,12,15,18,22,28
> node7: guests 26,32
> 
> As you can see, the nodes with more memory are _way_ overloaded, while 
> the lower memory ones are underutilized. In fact the first 20 guests 
> didn't use the other nodes at all.
> I don't care so much about the two memory-less nodes, but I'd like to 
> know how you came to the magic "3" in the formula:
> 
> > +
> > +    return sign(3*freememkb_diff + nrdomains_diff);
> > +}
> 
Ok. The idea behind the current implementation of that heuristics is to
prefer nodes with more free memory. In fact, this leaves larger "holes",
maximizing the probability of being able to put more domain there. Of
course that means more domains exploiting local accesses, but introduces
the risk of overloading large (from a memory POV) nodes with a lot of
domains (which seems right what's happening to you! :-P).

Therefore, I wanted to balance that by putting something related to the
current load on a node into the equation. Unfortunately, I really am not
sure yet what a reasonable estimation of the actual "load on a node"
could be. Even worse, Xen does not report anything even close to that,
at least not right now. That's why I went for a quite dumb count of the
number of domains for now, waiting to find the time to implement
something more clever.

So, that is basically why I thought it could be a good idea to
overweight the differences in free memory wrt the differences in number
of assigned domain. The first implementation was only considering the
number of assigned domain to decide which was the best candidate between
two that were less 10% different in their amount of free memory.
However, that didn't produce a good comparison function, and thus I
rewrote it like above, with the magic 3 selected via trial and error to
mimic something similar to the old 10% rule.

That all being said, this is the first time the patchset had the chance
to run on such a big system, so I'm definitely open to suggestion on how
to make that formula better in reflecting what we think it's The Right
Thing!

> I haven't done any measurements on this, but I guess scheduling 36 vCPUs 
> on 8 pCPUs has a much bigger performance penalty than any remote NUMA 
> access, which nowadays is much better than a few years ago, with big L3 
> caches, better predictors and faster interconnects.
> 
Definitely. Consider that the guests are being pinned because that is
what XenD does and because there has been no time to properly refine the
implementation of a more NUMA-aware scheduler for 4.2. In future, as
soon as I'll have it ready, the vcpus from the overloaded nodes would
get some runtime on the otherwise idle ones, even if they're remote.

Nevertheless, this is what Xen 4.2 will have, and I really think initial
placement is a very important step, and we must get the most out of
being able to do it well (as opposed to other technologies, where
something like that has to happen in the kernel/hypervisor, which
entails a lot of limitations we don't have!), and am therefore happy
about trying to do so as hard as I can.

> I will now put in the memory again and also try to play a bit with the 
> amount of memory and VCPUs per guest.
> 
Great, keep me super-posted about these things and feel free to ask
anything that comes to your mind! :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel