Re: [for-4.9] Re: HVM guest performance regression

From: Juergen Gross <jgross@suse.com>
To: Stefano Stabellini <sstabellini@kernel.org>
Cc: xen-devel <xen-devel@lists.xenproject.org>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	Wei Liu <wei.liu2@citrix.com>
Subject: Re: [for-4.9] Re: HVM guest performance regression
Date: Thu, 8 Jun 2017 20:28:08 +0200	[thread overview]
Message-ID: <e369ee76-1f91-fc08-dffd-f5d246c604d1@suse.com> (raw)
In-Reply-To: <alpine.DEB.2.10.1706081103510.26108@sstabellini-ThinkPad-X260>

On 08/06/17 20:09, Stefano Stabellini wrote:
> On Thu, 8 Jun 2017, Juergen Gross wrote:
>> On 07/06/17 20:19, Stefano Stabellini wrote:
>>> On Wed, 7 Jun 2017, Juergen Gross wrote:
>>>> On 06/06/17 21:08, Stefano Stabellini wrote:
>>>>> On Tue, 6 Jun 2017, Juergen Gross wrote:
>>>>>> On 06/06/17 18:39, Stefano Stabellini wrote:
>>>>>>> On Tue, 6 Jun 2017, Juergen Gross wrote:
>>>>>>>> On 26/05/17 21:01, Stefano Stabellini wrote:
>>>>>>>>> On Fri, 26 May 2017, Juergen Gross wrote:
>>>>>>>>>> On 26/05/17 18:19, Ian Jackson wrote:
>>>>>>>>>>> Juergen Gross writes ("HVM guest performance regression"):
>>>>>>>>>>>> Looking for the reason of a performance regression of HVM guests under
>>>>>>>>>>>> Xen 4.7 against 4.5 I found the reason to be commit
>>>>>>>>>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
>>>>>>>>>>>> in Xen 4.6.
>>>>>>>>>>>>
>>>>>>>>>>>> The problem occurred when dom0 had to be ballooned down when starting
>>>>>>>>>>>> the guest. The performance of some micro benchmarks dropped by about
>>>>>>>>>>>> a factor of 2 with above commit.
>>>>>>>>>>>>
>>>>>>>>>>>> Interesting point is that the performance of the guest will depend on
>>>>>>>>>>>> the amount of free memory being available at guest creation time.
>>>>>>>>>>>> When there was barely enough memory available for starting the guest
>>>>>>>>>>>> the performance will remain low even if memory is being freed later.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to suggest we either revert the commit or have some other
>>>>>>>>>>>> mechanism to try to have some reserve free memory when starting a
>>>>>>>>>>>> domain.
>>>>>>>>>>>
>>>>>>>>>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
>>>>>>>>>>> going to drain that swamp now, but I don't like regressions.
>>>>>>>>>>>
>>>>>>>>>>> I am not opposed to reverting that commit.  I was a bit iffy about it
>>>>>>>>>>> at the time; and according to the removal commit message, it was
>>>>>>>>>>> basically removed because it was a piece of cargo cult for which we
>>>>>>>>>>> had no justification in any of our records.
>>>>>>>>>>>
>>>>>>>>>>> Indeed I think fixing this is a candidate for 4.9.
>>>>>>>>>>>
>>>>>>>>>>> Do you know the mechanism by which the freemem slack helps ?  I think
>>>>>>>>>>> that would be a prerequisite for reverting this.  That way we can have
>>>>>>>>>>> an understanding of why we are doing things, rather than just
>>>>>>>>>>> flailing at random...
>>>>>>>>>>
>>>>>>>>>> I wish I would understand it.
>>>>>>>>>>
>>>>>>>>>> One candidate would be 2M/1G pages being possible with enough free
>>>>>>>>>> memory, but I haven't proofed this yet. I can have a try by disabling
>>>>>>>>>> big pages in the hypervisor.
>>>>>>>>>
>>>>>>>>> Right, if I had to bet, I would put my money on superpages shattering
>>>>>>>>> being the cause of the problem.
>>>>>>>>
>>>>>>>> Seems you would have lost your money...
>>>>>>>>
>>>>>>>> Meanwhile I've found a way to get the "good" performance in the micro
>>>>>>>> benchmark. Unfortunately this requires to switch off the pv interfaces
>>>>>>>> in the HVM guest via "xen_nopv" kernel boot parameter.
>>>>>>>>
>>>>>>>> I have verified that pv spinlocks are not to blame (via "xen_nopvspin"
>>>>>>>> kernel boot parameter). Switching to clocksource TSC in the running
>>>>>>>> system doesn't help either.
>>>>>>>
>>>>>>> What about xen_hvm_exit_mmap (an optimization for shadow pagetables) and
>>>>>>> xen_hvm_smp_init (PV IPI)?
>>>>>>
>>>>>> xen_hvm_exit_mmap isn't active (kernel message telling me so was
>>>>>> issued).
>>>>>>
>>>>>>>> Unfortunately the kernel seems no longer to be functional when I try to
>>>>>>>> tweak it not to use the PVHVM enhancements.
>>>>>>>
>>>>>>> I guess you are not talking about regular PV drivers like netfront and
>>>>>>> blkfront, right?
>>>>>>
>>>>>> The plan was to be able to use PV drivers without having to use PV
>>>>>> callbacks and PV timers. This isn't possible right now.
>>>>>
>>>>> I think the code to handle that scenario was gradually removed over time
>>>>> to simplify the code base.
>>>>
>>>> Hmm, too bad.
>>>>
>>>>>>>> I'm wondering now whether
>>>>>>>> there have ever been any benchmarks to proof PVHVM really being faster
>>>>>>>> than non-PVHVM? My findings seem to suggest there might be a huge
>>>>>>>> performance gap with PVHVM. OTOH this might depend on hardware and other
>>>>>>>> factors.
>>>>>>>>
>>>>>>>> Stefano, didn't you do the PVHVM stuff back in 2010? Do you have any
>>>>>>>> data from then regarding performance figures?
>>>>>>>
>>>>>>> Yes, I still have these slides:
>>>>>>>
>>>>>>> https://www.slideshare.net/xen_com_mgr/linux-pv-on-hvm
>>>>>>
>>>>>> Thanks. So you measured the overall package, not the single items like
>>>>>> callbacks, timers, time source? I'm asking because I start to believe
>>>>>> there are some of those slower than their non-PV variants.
>>>>>
>>>>> There isn't much left in terms of individual optimizations: you already
>>>>> tried switching clocksource and removing pv spinlocks. xen_hvm_exit_mmap
>>>>> is not used. Only the following are left (you might want to double check
>>>>> I haven't missed anything):
>>>>>
>>>>> 1) PV IPI
>>>>
>>>> Its a 1 vcpu guest.
>>>>
>>>>> 2) PV suspend/resume
>>>>> 3) vector callback
>>>>> 4) interrupt remapping
>>>>>
>>>>> 2) is not on the hot path.
>>>>> I did individual measurements of 3) at some points and it was a clear win.
>>>>
>>>> That might depend on the hardware. Could it be newer processors are
>>>> faster here?
>>>
>>> I don't think so: the alternative it's an emulated interrupt. It's
>>> slower under all points of view.
>>
>> What about APIC virtualization of modern processors? Are you sure e.g.
>> timer interrupts aren't handled completely by the processor? I guess
>> this might be faster than letting it be handled by the hypervisor and
>> then use the callback into the guest.
>>
>>> I would try to run the test with xen_emul_unplug="never" which means
>>> that you are going to end up using the emulated network card and
>>> emulated IDE controller, but some of the other optimizations (like the
>>> vector callback) will still be active.
>>
>> Now this is something I wouldn't like to do. My test isn't using any
>> I/O at all and is showing bad performance with pv interfaces being used.
>> The only remedy right now seems to be to switch off pv interfaces
>> leading to a bad I/O performance, but a good non-I/O performance.
>>
>> You are suggesting a mode with bad I/O performance _and_ bad non-I/O
>> performance.
> 
> I was only suggesting this for debugging, to better understand the
> problem, not as a solution.
> 
> 
>>> If the cause of the problem is ballooning for example, using emulated
>>> interfaces for IO will reduce the amount of ballooned out pages
>>> significantly.
>>
>> No I/O involved in my benchmark.
> 
> I admit that if your test doesn't do any I/O, it is not likely that
> xen_emul_unplug="never" will help us understand the problem.
> 
> Nonetheless, I believe that a simple blkfront/blkback or
> netfront/netback connection, even without any I/O being done, leads to a
> couple of calls into the ballooning code (xenbus_map_ring_valloc_hvm ->
> alloc_xenballooned_pages).

Only if the backend lives in a hvm domain. So in my case no problem, as
I have a classical pv dom0 hosting the backends.

Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel