From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Date: Fri, 20 Jul 2012 10:38:00 +0200 Message-ID: <500918E8.3000708@ts.fujitsu.com> References: <5fa66c8b9093399e5bc3.1342458792@Solace> <5007FBCE.6000201@amd.com> <1342707771.19530.235.camel@Solace> <1342772429.19530.247.camel@Solace> <5009161C.2060005@amd.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5009161C.2060005@amd.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Andre Przywara Cc: Ian Campbell , Stefano Stabellini , George Dunlap , Andrew Cooper , Ian Jackson , xen-devel , Dario Faggioli List-Id: xen-devel@lists.xenproject.org Am 20.07.2012 10:26, schrieb Andre Przywara: > On 07/20/2012 10:20 AM, Dario Faggioli wrote: >> On Thu, 2012-07-19 at 16:22 +0200, Dario Faggioli wrote: >>> Interesting. That's really the kind of testing we need in order to >>> fine-tune the details. Thanks for doing this. >>> >>>> Then I started 32 guests, each 4 vCPUs and 1 GB of RAM. >>>> Now since the code prefers free memory so much over free CPUs, the >>>> placement was the following: >>>> node0: guests 2,5,8,11,14,17,20,25,30 >>>> node1: guests 21,27 >>>> node2: none >>>> node3: none >>>> node4: guests 1,4,7,10,13,16,19,23,29 >>>> node5: guests 24,31 >>>> node6: guests 3,6,9,12,15,18,22,28 >>>> node7: guests 26,32 >>>> >>>> As you can see, the nodes with more memory are _way_ overloaded, while >>>> the lower memory ones are underutilized. In fact the first 20 guests >>>> didn't use the other nodes at all. >>>> I don't care so much about the two memory-less nodes, but I'd like to >>>> know how you came to the magic "3" in the formula: >>>> >>>>> + >>>>> + return sign(3*freememkb_diff + nrdomains_diff); >>>>> +} >>>> >>> >>> That all being said, this is the first time the patchset had the chance >>> to run on such a big system, so I'm definitely open to suggestion on how >>> to make that formula better in reflecting what we think it's The Right >>> Thing! >>> >> Thinking more about this, I realize that I was implicitly assuming some >> symmetry in the amount of memory each nodes comes with, which is >> probably something I shouldn't have done... >> >> I really am not sure what to do here, perhaps treating the two metrics >> more evenly? Or maybe even reverse the logic and give nr_domains more >> weight? > > I replaced the 3 with 1 already, that didn't change so much. I think we > should kind of reverse the importance of node load, since starving for > CPU time is much worse than bad memory latency. I will do some > experiments... > >> I was also thinking whether it could be worthwhile to consider the total >> number of vcpus on a node instead than the number of domain, but again, >> that's not guaranteed to be any more meaningful (suppose there are a lot >> of idle vcpus)... > > Right, that was my thinking on the ride to work also ;-) > What about this: 1P and 2P guests really use their vCPUs, but for bigger > guests we assume only a fractional usage? Hmm. I wouldn't be sure about this. I would guess there is a reason a guest has more than 1 vcpu. Normally this is because the guest needs more power. Having only 1 vcpu means this is enough. It probably would need only 0.1 vcpus. A guest with many vcpus suffers much more from a heavy loaded node, as lock contention in the guest is increasing with number of vcpus and waiting for a vcpu holding a lock but being preempted. Juergen -- Juergen Gross Principal Developer Operating Systems PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html