From mboxrd@z Thu Jan  1 00:00:00 1970
From: Juergen Gross <juergen.gross@ts.fujitsu.com>
Subject: Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic
 placement of guests on NUMA nodes
Date: Fri, 20 Jul 2012 10:38:00 +0200
Message-ID: <500918E8.3000708@ts.fujitsu.com>
References: <patchbomb.1342458791@Solace>	<5fa66c8b9093399e5bc3.1342458792@Solace>	<5007FBCE.6000201@amd.com>
	<1342707771.19530.235.camel@Solace>	<1342772429.19530.247.camel@Solace>
	<5009161C.2060005@amd.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <5009161C.2060005@amd.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Andre Przywara <andre.przywara@amd.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>, George Dunlap <george.dunlap@eu.citrix.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, xen-devel <xen-devel@lists.xen.org>, Dario Faggioli <raistlin@linux.it>
List-Id: xen-devel@lists.xenproject.org

Am 20.07.2012 10:26, schrieb Andre Przywara:
> On 07/20/2012 10:20 AM, Dario Faggioli wrote:
>> On Thu, 2012-07-19 at 16:22 +0200, Dario Faggioli wrote:
>>> Interesting. That's really the kind of testing we need in order to
>>> fine-tune the details. Thanks for doing this.
>>>
>>>> Then I started 32 guests, each 4 vCPUs and 1 GB of RAM.
>>>> Now since the code prefers free memory so much over free CPUs, the
>>>> placement was the following:
>>>> node0: guests 2,5,8,11,14,17,20,25,30
>>>> node1: guests 21,27
>>>> node2: none
>>>> node3: none
>>>> node4: guests 1,4,7,10,13,16,19,23,29
>>>> node5: guests 24,31
>>>> node6: guests 3,6,9,12,15,18,22,28
>>>> node7: guests 26,32
>>>>
>>>> As you can see, the nodes with more memory are _way_ overloaded, while
>>>> the lower memory ones are underutilized. In fact the first 20 guests
>>>> didn't use the other nodes at all.
>>>> I don't care so much about the two memory-less nodes, but I'd like to
>>>> know how you came to the magic "3" in the formula:
>>>>
>>>>> +
>>>>> + return sign(3*freememkb_diff + nrdomains_diff);
>>>>> +}
>>>>
>>>
>>> That all being said, this is the first time the patchset had the chance
>>> to run on such a big system, so I'm definitely open to suggestion on how
>>> to make that formula better in reflecting what we think it's The Right
>>> Thing!
>>>
>> Thinking more about this, I realize that I was implicitly assuming some
>> symmetry in the amount of memory each nodes comes with, which is
>> probably something I shouldn't have done...
>>
>> I really am not sure what to do here, perhaps treating the two metrics
>> more evenly? Or maybe even reverse the logic and give nr_domains more
>> weight?
>
> I replaced the 3 with 1 already, that didn't change so much. I think we
> should kind of reverse the importance of node load, since starving for
> CPU time is much worse than bad memory latency. I will do some
> experiments...
>
>> I was also thinking whether it could be worthwhile to consider the total
>> number of vcpus on a node instead than the number of domain, but again,
>> that's not guaranteed to be any more meaningful (suppose there are a lot
>> of idle vcpus)...
>
> Right, that was my thinking on the ride to work also ;-)
> What about this: 1P and 2P guests really use their vCPUs, but for bigger
> guests we assume only a fractional usage?

Hmm. I wouldn't be sure about this. I would guess there is a reason a guest has
more than 1 vcpu. Normally this is because the guest needs more power.

Having only 1 vcpu means this is enough. It probably would need only 0.1 vcpus.

A guest with many vcpus suffers much more from a heavy loaded node, as lock
contention in the guest is increasing with number of vcpus and waiting for a
vcpu holding a lock but being preempted.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PDG ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html