Re: [for-4.9] Re: HVM guest performance regression

From: Juergen Gross <jgross@suse.com>
To: Stefano Stabellini <sstabellini@kernel.org>
Cc: xen-devel <xen-devel@lists.xenproject.org>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	Wei Liu <wei.liu2@citrix.com>
Subject: Re: [for-4.9] Re: HVM guest performance regression
Date: Mon, 29 May 2017 21:05:02 +0200	[thread overview]
Message-ID: <8be5f350-ad53-d74c-50fc-7ca71b6cdc3c@suse.com> (raw)
In-Reply-To: <alpine.DEB.2.10.1705261201010.18759@sstabellini-ThinkPad-X260>

On 26/05/17 21:01, Stefano Stabellini wrote:
> On Fri, 26 May 2017, Juergen Gross wrote:
>> On 26/05/17 18:19, Ian Jackson wrote:
>>> Juergen Gross writes ("HVM guest performance regression"):
>>>> Looking for the reason of a performance regression of HVM guests under
>>>> Xen 4.7 against 4.5 I found the reason to be commit
>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
>>>> in Xen 4.6.
>>>>
>>>> The problem occurred when dom0 had to be ballooned down when starting
>>>> the guest. The performance of some micro benchmarks dropped by about
>>>> a factor of 2 with above commit.
>>>>
>>>> Interesting point is that the performance of the guest will depend on
>>>> the amount of free memory being available at guest creation time.
>>>> When there was barely enough memory available for starting the guest
>>>> the performance will remain low even if memory is being freed later.
>>>>
>>>> I'd like to suggest we either revert the commit or have some other
>>>> mechanism to try to have some reserve free memory when starting a
>>>> domain.
>>>
>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
>>> going to drain that swamp now, but I don't like regressions.
>>>
>>> I am not opposed to reverting that commit.  I was a bit iffy about it
>>> at the time; and according to the removal commit message, it was
>>> basically removed because it was a piece of cargo cult for which we
>>> had no justification in any of our records.
>>>
>>> Indeed I think fixing this is a candidate for 4.9.
>>>
>>> Do you know the mechanism by which the freemem slack helps ?  I think
>>> that would be a prerequisite for reverting this.  That way we can have
>>> an understanding of why we are doing things, rather than just
>>> flailing at random...
>>
>> I wish I would understand it.
>>
>> One candidate would be 2M/1G pages being possible with enough free
>> memory, but I haven't proofed this yet. I can have a try by disabling
>> big pages in the hypervisor.
> 
> Right, if I had to bet, I would put my money on superpages shattering
> being the cause of the problem.

Creating the domains with

xl -vvv create ...

showed the numbers of superpages and normal pages allocated for the
domain.

The following allocation pattern resulted in a slow domain:

xc: detail: PHYSICAL MEMORY ALLOCATION:
xc: detail:   4KB PAGES: 0x0000000000000600
xc: detail:   2MB PAGES: 0x00000000000003f9
xc: detail:   1GB PAGES: 0x0000000000000000

And this one was fast:

xc: detail: PHYSICAL MEMORY ALLOCATION:
xc: detail:   4KB PAGES: 0x0000000000000400
xc: detail:   2MB PAGES: 0x00000000000003fa
xc: detail:   1GB PAGES: 0x0000000000000000

I ballooned dom0 down in small steps to be able to create those
test cases.

I believe the main reason is that some data needed by the benchmark
is located near the end of domain memory resulting in a rather high
TLB miss rate in case of not all (or nearly all) memory available in
form of 2MB pages.

>> What makes the whole problem even more mysterious is that the
>> regression was detected first with SLE12 SP3 (guest and dom0, Xen 4.9
>> and Linux 4.4) against older systems (guest and dom0). While trying
>> to find out whether the guest or the Xen version are the culprit I
>> found that the old guest (based on kernel 3.12) showed the mentioned
>> performance drop with above commit. The new guest (based on kernel
>> 4.4) shows the same bad performance regardless of the Xen version or
>> amount of free memory. I haven't found the Linux kernel commit yet
>> being responsible for that performance drop.

And this might be result of a different memory usage of more recent
kernels: I suspect the critical data is now at the very end of the
domain's memory. As there are always some pages allocated in 4kB
chunks the last pages of the domain will never be part of a 2MB page.

Looking at meminit_hvm() in libxc doing the allocation of the memory
I realized it is kind of sub-optimal: shouldn't it try to allocate
the largest pages first and the smaller pages later?

Would it be possible to make memory holes larger sometimes to avoid
having to use 4kB pages (with the exception of the first 2MB of the
domain, of course)?

Maybe it would even make sense to be able to tweak the allocation
pattern depending on the guest type: preferring large pages either
at the top or at the bottom of the domain's physical address space.

And what should be done with the "freemem_slack" patch? With my
findings I don't think we can define a fixed percentage of the memory
which should be free. I could imagine some kind of mechanism using
dom0 ballooning more dynamically: As long as enough memory is unused
in dom0 balloon it down in case of an allocation failure of a large
page (1GB or 2MB). After all memory for the new domain has been
allocated balloon dom0 up again (but not more than before starting
creation of the new domain, of course).

Thoughts?

Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel