From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Date: Fri, 20 Jul 2012 10:19:08 +0200 Message-ID: <5009147C.9050604@amd.com> References: <5fa66c8b9093399e5bc3.1342458792@Solace> <5007FBCE.6000201@amd.com> <1342707771.19530.235.camel@Solace> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1342707771.19530.235.camel@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Ian Campbell , Stefano Stabellini , George Dunlap , Andrew Cooper , Juergen Gross , Ian Jackson , xen-devel List-Id: xen-devel@lists.xenproject.org On 07/19/2012 04:22 PM, Dario Faggioli wrote: > On Thu, 2012-07-19 at 14:21 +0200, Andre Przywara wrote: >> Dario, thanks for the warm welcome. >> ... >> As you can see, the nodes with more memory are _way_ overloaded, while >> the lower memory ones are underutilized. In fact the first 20 guests >> didn't use the other nodes at all. >> I don't care so much about the two memory-less nodes, but I'd like to >> know how you came to the magic "3" in the formula: >> >>> + >>> + return sign(3*freememkb_diff + nrdomains_diff); >>> +} >> > Ok. The idea behind the current implementation of that heuristics is to > prefer nodes with more free memory. In fact, this leaves larger "holes", > maximizing the probability of being able to put more domain there. Of > course that means more domains exploiting local accesses, but introduces > the risk of overloading large (from a memory POV) nodes with a lot of > domains (which seems right what's happening to you! :-P). I always assumed the vast majority of actual users/customers use comparably small domains, something like 1-4 VCPUs and like 4 GB of RAM. So these domains are much smaller than a usual node. I'd consider a node size of 16 GB the lower boundary, with up to 128GB as the common scenarios. Sure there are bigger or smaller machines, but I'd consider this the sweet spot. > Therefore, I wanted to balance that by putting something related to the > current load on a node into the equation. Unfortunately, I really am not > sure yet what a reasonable estimation of the actual "load on a node" > could be. Even worse, Xen does not report anything even close to that, > at least not right now. That's why I went for a quite dumb count of the > number of domains for now, waiting to find the time to implement > something more clever. Right. So we just use the number of already pinned vCPUs as the metric. Let me look if I can change the code to really use number of vCPUs instead of number of domains. A domain could be UP or 8-way SMP, which really makes much difference wrt to load on a node. In the long run we need something like a per-node (or per-pCPU) load average. We cannot foresee the future, but we just assume that the past is a good indicator for it. xl top generates such numbers on demand already. But that surely is something for 4.3, just wanted to mention it. > So, that is basically why I thought it could be a good idea to > overweight the differences in free memory wrt the differences in number > of assigned domain. The first implementation was only considering the > number of assigned domain to decide which was the best candidate between > two that were less 10% different in their amount of free memory. > However, that didn't produce a good comparison function, and thus I > rewrote it like above, with the magic 3 selected via trial and error to > mimic something similar to the old 10% rule. OK, I see. I thought about this a bit more and agree a single heuristic formula isn't easy to find. After reading the code I consider this a bit over-engineered, but I cannot possibly complain about this after having remained silent for such a long time. So lets see what we can make out of this code, just firing up some ideas, feel free to just ignore them in case they are dumb ;-) So if you agree to the small-domain assumption, then domains easily fitting into a node are the rule, not the exception. We should handle it that way. Maybe we can also solve the complexity problem by only generating single node candidates in the first place and only if these don't fit look at alternatives? I really admire that lazy comb_next generation function, so why we don't use it in a really lazy way? I think there was already a discussion about this, just don't remember what it's outcome was. Some debugging code showed that on the above (empty) machine a 2 VCPUs/2GB domain generated already 255 candidates. That really looks like overkill, especially if we actually should focus on the 8 single-node ones. Maybe we can use a two-step approach? First use a simple heuristic similar to the xend one: We only consider domains with enough free memory. Then we look for the least utilized ones: Simply calculate the difference between the number of currently pinned vCPUs and the number pCPUs. So any node with free (aka non-overcommited) CPUs should really be considered first. After all we don't need to care about memory latency if the domains starve for compute time and only get a fraction of each pCPU. If you don't want to believe this, I can run some benchmarks to prove this. If we somehow determine that this approach doesn't work (no nodes with enough free memory or more vCPUs than CPUs-per-node) we should use the sophisticated algorithm. Also consider the following: With really big machines or with odd configurations people will probably do their pinning/placement themselves (or by external mgmt applications). What this automatic placement algorithm is good for is more the What-is-this-NUMA-thingie-anyways people. > > That all being said, this is the first time the patchset had the chance > to run on such a big system, so I'm definitely open to suggestion on how > to make that formula better in reflecting what we think it's The Right > Thing! > >> I haven't done any measurements on this, but I guess scheduling 36 vCPUs >> on 8 pCPUs has a much bigger performance penalty than any remote NUMA >> access, which nowadays is much better than a few years ago, with big L3 >> caches, better predictors and faster interconnects. >> > Definitely. Consider that the guests are being pinned because that is > what XenD does and because there has been no time to properly refine the > implementation of a more NUMA-aware scheduler for 4.2. In future, as > soon as I'll have it ready, the vcpus from the overloaded nodes would > get some runtime on the otherwise idle ones, even if they're remote. Right, that sounds good. If you have any good (read: meaningful to customers) benchmarks I can do some experiments on my machine to fine-tune this. > Nevertheless, this is what Xen 4.2 will have, and I really think initial > placement is a very important step, and we must get the most out of > being able to do it well (as opposed to other technologies, where > something like that has to happen in the kernel/hypervisor, which > entails a lot of limitations we don't have!), and am therefore happy > about trying to do so as hard as I can. Right. We definitely need some placement for 4.2. Lets push this in if anyhow possible. >> I will now put in the memory again and also try to play a bit with the >> amount of memory and VCPUs per guest. >> > Great, keep me super-posted about these things and feel free to ask > anything that comes to your mind! :-) So changing the VCPUs and memory config didn't make any real difference. I think that is because the number of domains is considered, not the number of vCPUS. This should be fixed. Second I inserted the memory again. I only have 24 DIMMs for 32 sockets (we have easy access to boards and CPUs, but memory we have to buy like everyone else ;-), so I have to go with this alternating setup, having four 16GB nodes and four 8 GB nodes. This didn't change much, so 16 guests ended up with: 4-1-0-0-4-0-4-3 setup (domains per node). The 0's or 1's where the 8GB nodes. The guests were 2P/2GB ones. So far. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12