From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andre Przywara <andre.przywara@amd.com>
Subject: Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic
 placement of guests on NUMA nodes
Date: Fri, 20 Jul 2012 10:19:08 +0200
Message-ID: <5009147C.9050604@amd.com>
References: <patchbomb.1342458791@Solace>
	<5fa66c8b9093399e5bc3.1342458792@Solace>
	<5007FBCE.6000201@amd.com> <1342707771.19530.235.camel@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1342707771.19530.235.camel@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <raistlin@linux.it>
Cc: Ian Campbell <Ian.Campbell@citrix.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>, George Dunlap <george.dunlap@eu.citrix.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, xen-devel <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

On 07/19/2012 04:22 PM, Dario Faggioli wrote:
> On Thu, 2012-07-19 at 14:21 +0200, Andre Przywara wrote:
>> Dario,

thanks for the warm welcome.

 >> ...
>> As you can see, the nodes with more memory are _way_ overloaded, while
>> the lower memory ones are underutilized. In fact the first 20 guests
>> didn't use the other nodes at all.
>> I don't care so much about the two memory-less nodes, but I'd like to
>> know how you came to the magic "3" in the formula:
>>
>>> +
>>> +    return sign(3*freememkb_diff + nrdomains_diff);
>>> +}
>>
> Ok. The idea behind the current implementation of that heuristics is to
> prefer nodes with more free memory. In fact, this leaves larger "holes",
> maximizing the probability of being able to put more domain there. Of
> course that means more domains exploiting local accesses, but introduces
> the risk of overloading large (from a memory POV) nodes with a lot of
> domains (which seems right what's happening to you! :-P).

I always assumed the vast majority of actual users/customers use 
comparably small domains, something like 1-4 VCPUs and like 4 GB of RAM. 
So these domains are much smaller than a usual node. I'd consider a node 
size of 16 GB the lower boundary, with up to 128GB as the common 
scenarios. Sure there are bigger or smaller machines, but I'd consider 
this the sweet spot.

> Therefore, I wanted to balance that by putting something related to the
> current load on a node into the equation. Unfortunately, I really am not
> sure yet what a reasonable estimation of the actual "load on a node"
> could be. Even worse, Xen does not report anything even close to that,
> at least not right now. That's why I went for a quite dumb count of the
> number of domains for now, waiting to find the time to implement
> something more clever.

Right. So we just use the number of already pinned vCPUs as the metric. 
Let me look if I can change the code to really use number of vCPUs 
instead of number of domains. A domain could be UP or 8-way SMP, which 
really makes much difference wrt to load on a node.

In the long run we need something like a per-node (or per-pCPU) load 
average. We cannot foresee the future, but we just assume that the past 
is a good indicator for it. xl top generates such numbers on demand 
already. But that surely is something for 4.3, just wanted to mention it.

> So, that is basically why I thought it could be a good idea to
> overweight the differences in free memory wrt the differences in number
> of assigned domain. The first implementation was only considering the
> number of assigned domain to decide which was the best candidate between
> two that were less 10% different in their amount of free memory.
> However, that didn't produce a good comparison function, and thus I
> rewrote it like above, with the magic 3 selected via trial and error to
> mimic something similar to the old 10% rule.

OK, I see. I thought about this a bit more and agree a single heuristic 
formula isn't easy to find. After reading the code I consider this a bit 
over-engineered, but I cannot possibly complain about this after having 
remained silent for such a long time.
So lets see what we can make out of this code, just firing up some 
ideas, feel free to just ignore them in case they are dumb ;-)

So if you agree to the small-domain assumption, then domains easily 
fitting into a node are the rule, not the exception. We should handle it 
that way. Maybe we can also solve the complexity problem by only 
generating single node candidates in the first place and only if these 
don't fit look at alternatives?

I really admire that lazy comb_next generation function, so why we don't 
use it in a really lazy way? I think there was already a discussion 
about this, just don't remember what it's outcome was.
Some debugging code showed that on the above (empty) machine a 2 
VCPUs/2GB domain generated already 255 candidates. That really looks 
like overkill, especially if we actually should focus on the 8 
single-node ones.

Maybe we can use a two-step approach? First use a simple heuristic 
similar to the xend one:
We only consider domains with enough free memory. Then we look for the 
least utilized ones: Simply calculate the difference between the number 
of currently pinned vCPUs and the number pCPUs. So any node with free 
(aka non-overcommited) CPUs should really be considered first.
After all we don't need to care about memory latency if the domains 
starve for compute time and only get a fraction of each pCPU.
If you don't want to believe this, I can run some benchmarks to prove this.

If we somehow determine that this approach doesn't work (no nodes with 
enough free memory or more vCPUs than CPUs-per-node) we should use the 
sophisticated algorithm.

Also consider the following: With really big machines or with odd 
configurations people will probably do their pinning/placement 
themselves (or by external mgmt applications).
What this automatic placement algorithm is good for is more the 
What-is-this-NUMA-thingie-anyways people.

>
> That all being said, this is the first time the patchset had the chance
> to run on such a big system, so I'm definitely open to suggestion on how
> to make that formula better in reflecting what we think it's The Right
> Thing!
>
>> I haven't done any measurements on this, but I guess scheduling 36 vCPUs
>> on 8 pCPUs has a much bigger performance penalty than any remote NUMA
>> access, which nowadays is much better than a few years ago, with big L3
>> caches, better predictors and faster interconnects.
>>
> Definitely. Consider that the guests are being pinned because that is
> what XenD does and because there has been no time to properly refine the
> implementation of a more NUMA-aware scheduler for 4.2. In future, as
> soon as I'll have it ready, the vcpus from the overloaded nodes would
> get some runtime on the otherwise idle ones, even if they're remote.

Right, that sounds good. If you have any good (read: meaningful to 
customers) benchmarks I can do some experiments on my machine to 
fine-tune this.

> Nevertheless, this is what Xen 4.2 will have, and I really think initial
> placement is a very important step, and we must get the most out of
> being able to do it well (as opposed to other technologies, where
> something like that has to happen in the kernel/hypervisor, which
> entails a lot of limitations we don't have!), and am therefore happy
> about trying to do so as hard as I can.

Right. We definitely need some placement for 4.2. Lets push this in if 
anyhow possible.

>> I will now put in the memory again and also try to play a bit with the
>> amount of memory and VCPUs per guest.
>>
> Great, keep me super-posted about these things and feel free to ask
> anything that comes to your mind! :-)

So changing the VCPUs and memory config didn't make any real difference. 
I think that is because the number of domains is considered, not the 
number of vCPUS. This should be fixed.

Second I inserted the memory again. I only have 24 DIMMs for 32 sockets 
(we have easy access to boards and CPUs, but memory we have to buy like 
everyone else ;-), so I have to go with this alternating setup, having 
four 16GB nodes and four 8 GB nodes.

This didn't change much, so 16 guests ended up with:
4-1-0-0-4-0-4-3 setup (domains per node). The 0's or 1's where the 8GB 
nodes. The guests were 2P/2GB ones.

So far.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12