All of lore.kernel.org
 help / color / mirror / Atom feed
* HVM guest performance regression
@ 2017-05-26 16:14 Juergen Gross
  2017-05-26 16:19 ` [for-4.9] " Ian Jackson
  2017-05-26 17:04 ` Dario Faggioli
  0 siblings, 2 replies; 27+ messages in thread
From: Juergen Gross @ 2017-05-26 16:14 UTC (permalink / raw)
  To: xen-devel; +Cc: Ian Jackson, Stefano Stabellini, Wei Liu

Looking for the reason of a performance regression of HVM guests under
Xen 4.7 against 4.5 I found the reason to be commit
c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
in Xen 4.6.

The problem occurred when dom0 had to be ballooned down when starting
the guest. The performance of some micro benchmarks dropped by about
a factor of 2 with above commit.

Interesting point is that the performance of the guest will depend on
the amount of free memory being available at guest creation time.
When there was barely enough memory available for starting the guest
the performance will remain low even if memory is being freed later.

I'd like to suggest we either revert the commit or have some other
mechanism to try to have some reserve free memory when starting a
domain.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [for-4.9] Re: HVM guest performance regression
  2017-05-26 16:14 HVM guest performance regression Juergen Gross
@ 2017-05-26 16:19 ` Ian Jackson
  2017-05-26 17:00   ` Juergen Gross
  2017-05-26 17:04 ` Dario Faggioli
  1 sibling, 1 reply; 27+ messages in thread
From: Ian Jackson @ 2017-05-26 16:19 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel, Stefano Stabellini, Wei Liu

Juergen Gross writes ("HVM guest performance regression"):
> Looking for the reason of a performance regression of HVM guests under
> Xen 4.7 against 4.5 I found the reason to be commit
> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
> in Xen 4.6.
> 
> The problem occurred when dom0 had to be ballooned down when starting
> the guest. The performance of some micro benchmarks dropped by about
> a factor of 2 with above commit.
> 
> Interesting point is that the performance of the guest will depend on
> the amount of free memory being available at guest creation time.
> When there was barely enough memory available for starting the guest
> the performance will remain low even if memory is being freed later.
> 
> I'd like to suggest we either revert the commit or have some other
> mechanism to try to have some reserve free memory when starting a
> domain.

Oh, dear.  The memory accounting swamp again.  Clearly we are not
going to drain that swamp now, but I don't like regressions.

I am not opposed to reverting that commit.  I was a bit iffy about it
at the time; and according to the removal commit message, it was
basically removed because it was a piece of cargo cult for which we
had no justification in any of our records.

Indeed I think fixing this is a candidate for 4.9.

Do you know the mechanism by which the freemem slack helps ?  I think
that would be a prerequisite for reverting this.  That way we can have
an understanding of why we are doing things, rather than just
flailing at random...

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-05-26 16:19 ` [for-4.9] " Ian Jackson
@ 2017-05-26 17:00   ` Juergen Gross
  2017-05-26 19:01     ` Stefano Stabellini
  0 siblings, 1 reply; 27+ messages in thread
From: Juergen Gross @ 2017-05-26 17:00 UTC (permalink / raw)
  To: Ian Jackson; +Cc: xen-devel, Stefano Stabellini, Wei Liu

On 26/05/17 18:19, Ian Jackson wrote:
> Juergen Gross writes ("HVM guest performance regression"):
>> Looking for the reason of a performance regression of HVM guests under
>> Xen 4.7 against 4.5 I found the reason to be commit
>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
>> in Xen 4.6.
>>
>> The problem occurred when dom0 had to be ballooned down when starting
>> the guest. The performance of some micro benchmarks dropped by about
>> a factor of 2 with above commit.
>>
>> Interesting point is that the performance of the guest will depend on
>> the amount of free memory being available at guest creation time.
>> When there was barely enough memory available for starting the guest
>> the performance will remain low even if memory is being freed later.
>>
>> I'd like to suggest we either revert the commit or have some other
>> mechanism to try to have some reserve free memory when starting a
>> domain.
> 
> Oh, dear.  The memory accounting swamp again.  Clearly we are not
> going to drain that swamp now, but I don't like regressions.
> 
> I am not opposed to reverting that commit.  I was a bit iffy about it
> at the time; and according to the removal commit message, it was
> basically removed because it was a piece of cargo cult for which we
> had no justification in any of our records.
> 
> Indeed I think fixing this is a candidate for 4.9.
> 
> Do you know the mechanism by which the freemem slack helps ?  I think
> that would be a prerequisite for reverting this.  That way we can have
> an understanding of why we are doing things, rather than just
> flailing at random...

I wish I would understand it.

One candidate would be 2M/1G pages being possible with enough free
memory, but I haven't proofed this yet. I can have a try by disabling
big pages in the hypervisor.

What makes the whole problem even more mysterious is that the
regression was detected first with SLE12 SP3 (guest and dom0, Xen 4.9
and Linux 4.4) against older systems (guest and dom0). While trying
to find out whether the guest or the Xen version are the culprit I
found that the old guest (based on kernel 3.12) showed the mentioned
performance drop with above commit. The new guest (based on kernel
4.4) shows the same bad performance regardless of the Xen version or
amount of free memory. I haven't found the Linux kernel commit yet
being responsible for that performance drop.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: HVM guest performance regression
  2017-05-26 16:14 HVM guest performance regression Juergen Gross
  2017-05-26 16:19 ` [for-4.9] " Ian Jackson
@ 2017-05-26 17:04 ` Dario Faggioli
  2017-05-26 17:25   ` Juergen Gross
  1 sibling, 1 reply; 27+ messages in thread
From: Dario Faggioli @ 2017-05-26 17:04 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: Wei Liu, Stefano Stabellini, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 1170 bytes --]

On Fri, 2017-05-26 at 18:14 +0200, Juergen Gross wrote:
> Looking for the reason of a performance regression of HVM guests
> under
> Xen 4.7 against 4.5 I found the reason to be commit
> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove
> freemem_slack")
> in Xen 4.6.
> 
> The problem occurred when dom0 had to be ballooned down when starting
> the guest. The performance of some micro benchmarks dropped by about
> a factor of 2 with above commit.
> 
Performance of micro benchmarks run _inside_ the guest, I'm guessing?

> Interesting point is that the performance of the guest will depend on
> the amount of free memory being available at guest creation time.
> When there was barely enough memory available for starting the guest
> the performance will remain low even if memory is being freed later.
> 
OOC, what kind of host? Big? Small? NUMA, non-NUMA?, etc

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: HVM guest performance regression
  2017-05-26 17:04 ` Dario Faggioli
@ 2017-05-26 17:25   ` Juergen Gross
  0 siblings, 0 replies; 27+ messages in thread
From: Juergen Gross @ 2017-05-26 17:25 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Ian Jackson, Stefano Stabellini, Wei Liu

On 26/05/17 19:04, Dario Faggioli wrote:
> On Fri, 2017-05-26 at 18:14 +0200, Juergen Gross wrote:
>> Looking for the reason of a performance regression of HVM guests
>> under
>> Xen 4.7 against 4.5 I found the reason to be commit
>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove
>> freemem_slack")
>> in Xen 4.6.
>>
>> The problem occurred when dom0 had to be ballooned down when starting
>> the guest. The performance of some micro benchmarks dropped by about
>> a factor of 2 with above commit.
>>
> Performance of micro benchmarks run _inside_ the guest, I'm guessing?

Yep. libmicro benchmark "munmap".

>> Interesting point is that the performance of the guest will depend on
>> the amount of free memory being available at guest creation time.
>> When there was barely enough memory available for starting the guest
>> the performance will remain low even if memory is being freed later.
>>
> OOC, what kind of host? Big? Small? NUMA, non-NUMA?, etc

I've tested this to happen _always_ on my laptop (Dual core Intel(R)
Core(TM) i7-4600M CPU @ 2.90GHz, 8GB memory, non-NUMA).

Guest size was 2GB, 1 vcpu.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-05-26 17:00   ` Juergen Gross
@ 2017-05-26 19:01     ` Stefano Stabellini
  2017-05-29 19:05       ` Juergen Gross
  2017-06-06 13:44       ` Juergen Gross
  0 siblings, 2 replies; 27+ messages in thread
From: Stefano Stabellini @ 2017-05-26 19:01 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel, Stefano Stabellini, Ian Jackson, Wei Liu

On Fri, 26 May 2017, Juergen Gross wrote:
> On 26/05/17 18:19, Ian Jackson wrote:
> > Juergen Gross writes ("HVM guest performance regression"):
> >> Looking for the reason of a performance regression of HVM guests under
> >> Xen 4.7 against 4.5 I found the reason to be commit
> >> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
> >> in Xen 4.6.
> >>
> >> The problem occurred when dom0 had to be ballooned down when starting
> >> the guest. The performance of some micro benchmarks dropped by about
> >> a factor of 2 with above commit.
> >>
> >> Interesting point is that the performance of the guest will depend on
> >> the amount of free memory being available at guest creation time.
> >> When there was barely enough memory available for starting the guest
> >> the performance will remain low even if memory is being freed later.
> >>
> >> I'd like to suggest we either revert the commit or have some other
> >> mechanism to try to have some reserve free memory when starting a
> >> domain.
> > 
> > Oh, dear.  The memory accounting swamp again.  Clearly we are not
> > going to drain that swamp now, but I don't like regressions.
> > 
> > I am not opposed to reverting that commit.  I was a bit iffy about it
> > at the time; and according to the removal commit message, it was
> > basically removed because it was a piece of cargo cult for which we
> > had no justification in any of our records.
> > 
> > Indeed I think fixing this is a candidate for 4.9.
> > 
> > Do you know the mechanism by which the freemem slack helps ?  I think
> > that would be a prerequisite for reverting this.  That way we can have
> > an understanding of why we are doing things, rather than just
> > flailing at random...
> 
> I wish I would understand it.
> 
> One candidate would be 2M/1G pages being possible with enough free
> memory, but I haven't proofed this yet. I can have a try by disabling
> big pages in the hypervisor.

Right, if I had to bet, I would put my money on superpages shattering
being the cause of the problem.


> What makes the whole problem even more mysterious is that the
> regression was detected first with SLE12 SP3 (guest and dom0, Xen 4.9
> and Linux 4.4) against older systems (guest and dom0). While trying
> to find out whether the guest or the Xen version are the culprit I
> found that the old guest (based on kernel 3.12) showed the mentioned
> performance drop with above commit. The new guest (based on kernel
> 4.4) shows the same bad performance regardless of the Xen version or
> amount of free memory. I haven't found the Linux kernel commit yet
> being responsible for that performance drop.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-05-26 19:01     ` Stefano Stabellini
@ 2017-05-29 19:05       ` Juergen Gross
  2017-05-30  7:24         ` Jan Beulich
       [not found]         ` <592D3A3A020000780015D787@suse.com>
  2017-06-06 13:44       ` Juergen Gross
  1 sibling, 2 replies; 27+ messages in thread
From: Juergen Gross @ 2017-05-29 19:05 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Ian Jackson, Wei Liu

On 26/05/17 21:01, Stefano Stabellini wrote:
> On Fri, 26 May 2017, Juergen Gross wrote:
>> On 26/05/17 18:19, Ian Jackson wrote:
>>> Juergen Gross writes ("HVM guest performance regression"):
>>>> Looking for the reason of a performance regression of HVM guests under
>>>> Xen 4.7 against 4.5 I found the reason to be commit
>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
>>>> in Xen 4.6.
>>>>
>>>> The problem occurred when dom0 had to be ballooned down when starting
>>>> the guest. The performance of some micro benchmarks dropped by about
>>>> a factor of 2 with above commit.
>>>>
>>>> Interesting point is that the performance of the guest will depend on
>>>> the amount of free memory being available at guest creation time.
>>>> When there was barely enough memory available for starting the guest
>>>> the performance will remain low even if memory is being freed later.
>>>>
>>>> I'd like to suggest we either revert the commit or have some other
>>>> mechanism to try to have some reserve free memory when starting a
>>>> domain.
>>>
>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
>>> going to drain that swamp now, but I don't like regressions.
>>>
>>> I am not opposed to reverting that commit.  I was a bit iffy about it
>>> at the time; and according to the removal commit message, it was
>>> basically removed because it was a piece of cargo cult for which we
>>> had no justification in any of our records.
>>>
>>> Indeed I think fixing this is a candidate for 4.9.
>>>
>>> Do you know the mechanism by which the freemem slack helps ?  I think
>>> that would be a prerequisite for reverting this.  That way we can have
>>> an understanding of why we are doing things, rather than just
>>> flailing at random...
>>
>> I wish I would understand it.
>>
>> One candidate would be 2M/1G pages being possible with enough free
>> memory, but I haven't proofed this yet. I can have a try by disabling
>> big pages in the hypervisor.
> 
> Right, if I had to bet, I would put my money on superpages shattering
> being the cause of the problem.

Creating the domains with

xl -vvv create ...

showed the numbers of superpages and normal pages allocated for the
domain.

The following allocation pattern resulted in a slow domain:

xc: detail: PHYSICAL MEMORY ALLOCATION:
xc: detail:   4KB PAGES: 0x0000000000000600
xc: detail:   2MB PAGES: 0x00000000000003f9
xc: detail:   1GB PAGES: 0x0000000000000000

And this one was fast:

xc: detail: PHYSICAL MEMORY ALLOCATION:
xc: detail:   4KB PAGES: 0x0000000000000400
xc: detail:   2MB PAGES: 0x00000000000003fa
xc: detail:   1GB PAGES: 0x0000000000000000

I ballooned dom0 down in small steps to be able to create those
test cases.

I believe the main reason is that some data needed by the benchmark
is located near the end of domain memory resulting in a rather high
TLB miss rate in case of not all (or nearly all) memory available in
form of 2MB pages.

>> What makes the whole problem even more mysterious is that the
>> regression was detected first with SLE12 SP3 (guest and dom0, Xen 4.9
>> and Linux 4.4) against older systems (guest and dom0). While trying
>> to find out whether the guest or the Xen version are the culprit I
>> found that the old guest (based on kernel 3.12) showed the mentioned
>> performance drop with above commit. The new guest (based on kernel
>> 4.4) shows the same bad performance regardless of the Xen version or
>> amount of free memory. I haven't found the Linux kernel commit yet
>> being responsible for that performance drop.

And this might be result of a different memory usage of more recent
kernels: I suspect the critical data is now at the very end of the
domain's memory. As there are always some pages allocated in 4kB
chunks the last pages of the domain will never be part of a 2MB page.

Looking at meminit_hvm() in libxc doing the allocation of the memory
I realized it is kind of sub-optimal: shouldn't it try to allocate
the largest pages first and the smaller pages later?

Would it be possible to make memory holes larger sometimes to avoid
having to use 4kB pages (with the exception of the first 2MB of the
domain, of course)?

Maybe it would even make sense to be able to tweak the allocation
pattern depending on the guest type: preferring large pages either
at the top or at the bottom of the domain's physical address space.

And what should be done with the "freemem_slack" patch? With my
findings I don't think we can define a fixed percentage of the memory
which should be free. I could imagine some kind of mechanism using
dom0 ballooning more dynamically: As long as enough memory is unused
in dom0 balloon it down in case of an allocation failure of a large
page (1GB or 2MB). After all memory for the new domain has been
allocated balloon dom0 up again (but not more than before starting
creation of the new domain, of course).

Thoughts?


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-05-29 19:05       ` Juergen Gross
@ 2017-05-30  7:24         ` Jan Beulich
       [not found]         ` <592D3A3A020000780015D787@suse.com>
  1 sibling, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2017-05-30  7:24 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Ian Jackson, Stefano Stabellini, Wei Liu, xen-devel

>>> On 29.05.17 at 21:05, <jgross@suse.com> wrote:
> Creating the domains with
> 
> xl -vvv create ...
> 
> showed the numbers of superpages and normal pages allocated for the
> domain.
> 
> The following allocation pattern resulted in a slow domain:
> 
> xc: detail: PHYSICAL MEMORY ALLOCATION:
> xc: detail:   4KB PAGES: 0x0000000000000600
> xc: detail:   2MB PAGES: 0x00000000000003f9
> xc: detail:   1GB PAGES: 0x0000000000000000
> 
> And this one was fast:
> 
> xc: detail: PHYSICAL MEMORY ALLOCATION:
> xc: detail:   4KB PAGES: 0x0000000000000400
> xc: detail:   2MB PAGES: 0x00000000000003fa
> xc: detail:   1GB PAGES: 0x0000000000000000
> 
> I ballooned dom0 down in small steps to be able to create those
> test cases.
> 
> I believe the main reason is that some data needed by the benchmark
> is located near the end of domain memory resulting in a rather high
> TLB miss rate in case of not all (or nearly all) memory available in
> form of 2MB pages.

Did you double check this by creating some other (persistent)
process prior to running your benchmark? I find it rather
unlikely that you would consistently see space from the top of
guest RAM allocated to your test, unless it consumes all RAM
that's available at the time it runs (but then I'd consider it
quite likely for overhead of using the few smaller pages to be
mostly hidden in the noise).

Or are you suspecting some crucial kernel structures to live
there?

>>> What makes the whole problem even more mysterious is that the
>>> regression was detected first with SLE12 SP3 (guest and dom0, Xen 4.9
>>> and Linux 4.4) against older systems (guest and dom0). While trying
>>> to find out whether the guest or the Xen version are the culprit I
>>> found that the old guest (based on kernel 3.12) showed the mentioned
>>> performance drop with above commit. The new guest (based on kernel
>>> 4.4) shows the same bad performance regardless of the Xen version or
>>> amount of free memory. I haven't found the Linux kernel commit yet
>>> being responsible for that performance drop.
> 
> And this might be result of a different memory usage of more recent
> kernels: I suspect the critical data is now at the very end of the
> domain's memory. As there are always some pages allocated in 4kB
> chunks the last pages of the domain will never be part of a 2MB page.

But if the OS allocated large pages internally for relevant data
structures, those obviously won't come from that necessarily 4k-
mapped tail range.

> Looking at meminit_hvm() in libxc doing the allocation of the memory
> I realized it is kind of sub-optimal: shouldn't it try to allocate
> the largest pages first and the smaller pages later?

Indeed this seems sub-optimal, yet the net effect isn't that
dramatic (at least for sufficiently large guests): There may be up
to two unnecessarily shattered 1G pages and at most one 2M
one afaict.

> Would it be possible to make memory holes larger sometimes to avoid
> having to use 4kB pages (with the exception of the first 2MB of the
> domain, of course)?

Which holes are you thinking about here? The pre-determined
one is at 0xF0000000 (i.e. is 2M-aligned already), and without
pass-through devices with large BARs hvmloader won't do any
relocation of RAM. Granted, when it does, doing so in larger
than 64k chunks may be advantageous. To have any effect,
that would require hypervisor side changes though, as
xenmem_add_to_physmap() acts on individual 4k pages right
now.

> Maybe it would even make sense to be able to tweak the allocation
> pattern depending on the guest type: preferring large pages either
> at the top or at the bottom of the domain's physical address space.

Why would top and bottom be better candidates for using large
pages than the middle part of address space? Any such heuristic
would surely need tailoring to the guest OS in order to not
adversely affect some while helping others.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
       [not found]         ` <592D3A3A020000780015D787@suse.com>
@ 2017-05-30 10:33           ` Juergen Gross
  2017-05-30 10:43             ` Jan Beulich
       [not found]             ` <592D68DC020000780015D919@suse.com>
  0 siblings, 2 replies; 27+ messages in thread
From: Juergen Gross @ 2017-05-30 10:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Jackson, Stefano Stabellini, Wei Liu, xen-devel

On 30/05/17 09:24, Jan Beulich wrote:
>>>> On 29.05.17 at 21:05, <jgross@suse.com> wrote:
>> Creating the domains with
>>
>> xl -vvv create ...
>>
>> showed the numbers of superpages and normal pages allocated for the
>> domain.
>>
>> The following allocation pattern resulted in a slow domain:
>>
>> xc: detail: PHYSICAL MEMORY ALLOCATION:
>> xc: detail:   4KB PAGES: 0x0000000000000600
>> xc: detail:   2MB PAGES: 0x00000000000003f9
>> xc: detail:   1GB PAGES: 0x0000000000000000
>>
>> And this one was fast:
>>
>> xc: detail: PHYSICAL MEMORY ALLOCATION:
>> xc: detail:   4KB PAGES: 0x0000000000000400
>> xc: detail:   2MB PAGES: 0x00000000000003fa
>> xc: detail:   1GB PAGES: 0x0000000000000000
>>
>> I ballooned dom0 down in small steps to be able to create those
>> test cases.
>>
>> I believe the main reason is that some data needed by the benchmark
>> is located near the end of domain memory resulting in a rather high
>> TLB miss rate in case of not all (or nearly all) memory available in
>> form of 2MB pages.
> 
> Did you double check this by creating some other (persistent)
> process prior to running your benchmark? I find it rather
> unlikely that you would consistently see space from the top of
> guest RAM allocated to your test, unless it consumes all RAM
> that's available at the time it runs (but then I'd consider it
> quite likely for overhead of using the few smaller pages to be
> mostly hidden in the noise).
> 
> Or are you suspecting some crucial kernel structures to live
> there?

Yes, I do. When onlining memory at boot time the kernel is using the new
memory chunk to add the page structures and if needed new kernel page
tables. It is normally allocating that memory from the end of the new
chunk.

> 
>>>> What makes the whole problem even more mysterious is that the
>>>> regression was detected first with SLE12 SP3 (guest and dom0, Xen 4.9
>>>> and Linux 4.4) against older systems (guest and dom0). While trying
>>>> to find out whether the guest or the Xen version are the culprit I
>>>> found that the old guest (based on kernel 3.12) showed the mentioned
>>>> performance drop with above commit. The new guest (based on kernel
>>>> 4.4) shows the same bad performance regardless of the Xen version or
>>>> amount of free memory. I haven't found the Linux kernel commit yet
>>>> being responsible for that performance drop.
>>
>> And this might be result of a different memory usage of more recent
>> kernels: I suspect the critical data is now at the very end of the
>> domain's memory. As there are always some pages allocated in 4kB
>> chunks the last pages of the domain will never be part of a 2MB page.
> 
> But if the OS allocated large pages internally for relevant data
> structures, those obviously won't come from that necessarily 4k-
> mapped tail range.

Sure? I think the kernel is using 1GB pages if possible for direct
kernel mappings of the physical memory. It doesn't care for the last
page mapping some space not populated.

> 
>> Looking at meminit_hvm() in libxc doing the allocation of the memory
>> I realized it is kind of sub-optimal: shouldn't it try to allocate
>> the largest pages first and the smaller pages later?
> 
> Indeed this seems sub-optimal, yet the net effect isn't that
> dramatic (at least for sufficiently large guests): There may be up
> to two unnecessarily shattered 1G pages and at most one 2M
> one afaict.

Right. So there might be nearly 1 GB allocated using 2MB pages until
the first GB page is tried. This will rise the probability for a
failing GB allocation quite notably in case of dom0 having been
ballooned down for guest creation.

>> Would it be possible to make memory holes larger sometimes to avoid
>> having to use 4kB pages (with the exception of the first 2MB of the
>> domain, of course)?
> 
> Which holes are you thinking about here? The pre-determined
> one is at 0xF0000000 (i.e. is 2M-aligned already), and without
> pass-through devices with large BARs hvmloader won't do any
> relocation of RAM. Granted, when it does, doing so in larger
> than 64k chunks may be advantageous. To have any effect,
> that would require hypervisor side changes though, as
> xenmem_add_to_physmap() acts on individual 4k pages right
> now.

Okay.

>> Maybe it would even make sense to be able to tweak the allocation
>> pattern depending on the guest type: preferring large pages either
>> at the top or at the bottom of the domain's physical address space.
> 
> Why would top and bottom be better candidates for using large
> pages than the middle part of address space? Any such heuristic
> would surely need tailoring to the guest OS in order to not
> adversely affect some while helping others.

Right. This would have to be a guest configuration item.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-05-30 10:33           ` Juergen Gross
@ 2017-05-30 10:43             ` Jan Beulich
       [not found]             ` <592D68DC020000780015D919@suse.com>
  1 sibling, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2017-05-30 10:43 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Ian Jackson, Stefano Stabellini, Wei Liu, xen-devel

>>> On 30.05.17 at 12:33, <jgross@suse.com> wrote:
> On 30/05/17 09:24, Jan Beulich wrote:
>>>>> On 29.05.17 at 21:05, <jgross@suse.com> wrote:
>>> Creating the domains with
>>>
>>> xl -vvv create ...
>>>
>>> showed the numbers of superpages and normal pages allocated for the
>>> domain.
>>>
>>> The following allocation pattern resulted in a slow domain:
>>>
>>> xc: detail: PHYSICAL MEMORY ALLOCATION:
>>> xc: detail:   4KB PAGES: 0x0000000000000600
>>> xc: detail:   2MB PAGES: 0x00000000000003f9
>>> xc: detail:   1GB PAGES: 0x0000000000000000
>>>
>>> And this one was fast:
>>>
>>> xc: detail: PHYSICAL MEMORY ALLOCATION:
>>> xc: detail:   4KB PAGES: 0x0000000000000400
>>> xc: detail:   2MB PAGES: 0x00000000000003fa
>>> xc: detail:   1GB PAGES: 0x0000000000000000
>>>
>>> I ballooned dom0 down in small steps to be able to create those
>>> test cases.
>>>
>>> I believe the main reason is that some data needed by the benchmark
>>> is located near the end of domain memory resulting in a rather high
>>> TLB miss rate in case of not all (or nearly all) memory available in
>>> form of 2MB pages.
>> 
>> Did you double check this by creating some other (persistent)
>> process prior to running your benchmark? I find it rather
>> unlikely that you would consistently see space from the top of
>> guest RAM allocated to your test, unless it consumes all RAM
>> that's available at the time it runs (but then I'd consider it
>> quite likely for overhead of using the few smaller pages to be
>> mostly hidden in the noise).
>> 
>> Or are you suspecting some crucial kernel structures to live
>> there?
> 
> Yes, I do. When onlining memory at boot time the kernel is using the new
> memory chunk to add the page structures and if needed new kernel page
> tables. It is normally allocating that memory from the end of the new
> chunk.

The page tables are 4k allocations, sure. But the page structures
surely would be allocated with higher granularity?

>>>>> What makes the whole problem even more mysterious is that the
>>>>> regression was detected first with SLE12 SP3 (guest and dom0, Xen 4.9
>>>>> and Linux 4.4) against older systems (guest and dom0). While trying
>>>>> to find out whether the guest or the Xen version are the culprit I
>>>>> found that the old guest (based on kernel 3.12) showed the mentioned
>>>>> performance drop with above commit. The new guest (based on kernel
>>>>> 4.4) shows the same bad performance regardless of the Xen version or
>>>>> amount of free memory. I haven't found the Linux kernel commit yet
>>>>> being responsible for that performance drop.
>>>
>>> And this might be result of a different memory usage of more recent
>>> kernels: I suspect the critical data is now at the very end of the
>>> domain's memory. As there are always some pages allocated in 4kB
>>> chunks the last pages of the domain will never be part of a 2MB page.
>> 
>> But if the OS allocated large pages internally for relevant data
>> structures, those obviously won't come from that necessarily 4k-
>> mapped tail range.
> 
> Sure? I think the kernel is using 1GB pages if possible for direct
> kernel mappings of the physical memory. It doesn't care for the last
> page mapping some space not populated.

Are you sure? I would very much hope for Linux to not establish
mappings to addresses where no memory (and no MMIO) resides.
But I can't tell for sure for recent Linux versions; I do know in the
old days they were quite careful there.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
       [not found]             ` <592D68DC020000780015D919@suse.com>
@ 2017-05-30 14:57               ` Juergen Gross
  2017-05-30 15:10                 ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Juergen Gross @ 2017-05-30 14:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Jackson, Stefano Stabellini, Wei Liu, xen-devel

On 30/05/17 12:43, Jan Beulich wrote:
>>>> On 30.05.17 at 12:33, <jgross@suse.com> wrote:
>> On 30/05/17 09:24, Jan Beulich wrote:
>>>>>> On 29.05.17 at 21:05, <jgross@suse.com> wrote:
>>>> Creating the domains with
>>>>
>>>> xl -vvv create ...
>>>>
>>>> showed the numbers of superpages and normal pages allocated for the
>>>> domain.
>>>>
>>>> The following allocation pattern resulted in a slow domain:
>>>>
>>>> xc: detail: PHYSICAL MEMORY ALLOCATION:
>>>> xc: detail:   4KB PAGES: 0x0000000000000600
>>>> xc: detail:   2MB PAGES: 0x00000000000003f9
>>>> xc: detail:   1GB PAGES: 0x0000000000000000
>>>>
>>>> And this one was fast:
>>>>
>>>> xc: detail: PHYSICAL MEMORY ALLOCATION:
>>>> xc: detail:   4KB PAGES: 0x0000000000000400
>>>> xc: detail:   2MB PAGES: 0x00000000000003fa
>>>> xc: detail:   1GB PAGES: 0x0000000000000000
>>>>
>>>> I ballooned dom0 down in small steps to be able to create those
>>>> test cases.
>>>>
>>>> I believe the main reason is that some data needed by the benchmark
>>>> is located near the end of domain memory resulting in a rather high
>>>> TLB miss rate in case of not all (or nearly all) memory available in
>>>> form of 2MB pages.
>>>
>>> Did you double check this by creating some other (persistent)
>>> process prior to running your benchmark? I find it rather
>>> unlikely that you would consistently see space from the top of
>>> guest RAM allocated to your test, unless it consumes all RAM
>>> that's available at the time it runs (but then I'd consider it
>>> quite likely for overhead of using the few smaller pages to be
>>> mostly hidden in the noise).
>>>
>>> Or are you suspecting some crucial kernel structures to live
>>> there?
>>
>> Yes, I do. When onlining memory at boot time the kernel is using the new
>> memory chunk to add the page structures and if needed new kernel page
>> tables. It is normally allocating that memory from the end of the new
>> chunk.
> 
> The page tables are 4k allocations, sure. But the page structures
> surely would be allocated with higher granularity?

I'm really not sure. It might depend on the memory model (sparse,
sparse vmemmap, flat).

>>>>>> What makes the whole problem even more mysterious is that the
>>>>>> regression was detected first with SLE12 SP3 (guest and dom0, Xen 4.9
>>>>>> and Linux 4.4) against older systems (guest and dom0). While trying
>>>>>> to find out whether the guest or the Xen version are the culprit I
>>>>>> found that the old guest (based on kernel 3.12) showed the mentioned
>>>>>> performance drop with above commit. The new guest (based on kernel
>>>>>> 4.4) shows the same bad performance regardless of the Xen version or
>>>>>> amount of free memory. I haven't found the Linux kernel commit yet
>>>>>> being responsible for that performance drop.
>>>>
>>>> And this might be result of a different memory usage of more recent
>>>> kernels: I suspect the critical data is now at the very end of the
>>>> domain's memory. As there are always some pages allocated in 4kB
>>>> chunks the last pages of the domain will never be part of a 2MB page.
>>>
>>> But if the OS allocated large pages internally for relevant data
>>> structures, those obviously won't come from that necessarily 4k-
>>> mapped tail range.
>>
>> Sure? I think the kernel is using 1GB pages if possible for direct
>> kernel mappings of the physical memory. It doesn't care for the last
>> page mapping some space not populated.
> 
> Are you sure? I would very much hope for Linux to not establish
> mappings to addresses where no memory (and no MMIO) resides.
> But I can't tell for sure for recent Linux versions; I do know in the
> old days they were quite careful there.

Looking at phys_pud_init() they are happily using 1GB pages until they
have all memory mapped.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-05-30 14:57               ` Juergen Gross
@ 2017-05-30 15:10                 ` Jan Beulich
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2017-05-30 15:10 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Ian Jackson, Stefano Stabellini, Wei Liu, xen-devel

>>> On 30.05.17 at 16:57, <jgross@suse.com> wrote:
> On 30/05/17 12:43, Jan Beulich wrote:
>>>>> On 30.05.17 at 12:33, <jgross@suse.com> wrote:
>>> On 30/05/17 09:24, Jan Beulich wrote:
>>>> But if the OS allocated large pages internally for relevant data
>>>> structures, those obviously won't come from that necessarily 4k-
>>>> mapped tail range.
>>>
>>> Sure? I think the kernel is using 1GB pages if possible for direct
>>> kernel mappings of the physical memory. It doesn't care for the last
>>> page mapping some space not populated.
>> 
>> Are you sure? I would very much hope for Linux to not establish
>> mappings to addresses where no memory (and no MMIO) resides.
>> But I can't tell for sure for recent Linux versions; I do know in the
>> old days they were quite careful there.
> 
> Looking at phys_pud_init() they are happily using 1GB pages until they
> have all memory mapped.

It's the layers higher up which I think make sure to call this with
bit PG_LEVEL_1G set only when covering all RAM (see
split_mem_range()).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-05-26 19:01     ` Stefano Stabellini
  2017-05-29 19:05       ` Juergen Gross
@ 2017-06-06 13:44       ` Juergen Gross
  2017-06-06 16:39         ` Stefano Stabellini
  1 sibling, 1 reply; 27+ messages in thread
From: Juergen Gross @ 2017-06-06 13:44 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Ian Jackson, Wei Liu

On 26/05/17 21:01, Stefano Stabellini wrote:
> On Fri, 26 May 2017, Juergen Gross wrote:
>> On 26/05/17 18:19, Ian Jackson wrote:
>>> Juergen Gross writes ("HVM guest performance regression"):
>>>> Looking for the reason of a performance regression of HVM guests under
>>>> Xen 4.7 against 4.5 I found the reason to be commit
>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
>>>> in Xen 4.6.
>>>>
>>>> The problem occurred when dom0 had to be ballooned down when starting
>>>> the guest. The performance of some micro benchmarks dropped by about
>>>> a factor of 2 with above commit.
>>>>
>>>> Interesting point is that the performance of the guest will depend on
>>>> the amount of free memory being available at guest creation time.
>>>> When there was barely enough memory available for starting the guest
>>>> the performance will remain low even if memory is being freed later.
>>>>
>>>> I'd like to suggest we either revert the commit or have some other
>>>> mechanism to try to have some reserve free memory when starting a
>>>> domain.
>>>
>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
>>> going to drain that swamp now, but I don't like regressions.
>>>
>>> I am not opposed to reverting that commit.  I was a bit iffy about it
>>> at the time; and according to the removal commit message, it was
>>> basically removed because it was a piece of cargo cult for which we
>>> had no justification in any of our records.
>>>
>>> Indeed I think fixing this is a candidate for 4.9.
>>>
>>> Do you know the mechanism by which the freemem slack helps ?  I think
>>> that would be a prerequisite for reverting this.  That way we can have
>>> an understanding of why we are doing things, rather than just
>>> flailing at random...
>>
>> I wish I would understand it.
>>
>> One candidate would be 2M/1G pages being possible with enough free
>> memory, but I haven't proofed this yet. I can have a try by disabling
>> big pages in the hypervisor.
> 
> Right, if I had to bet, I would put my money on superpages shattering
> being the cause of the problem.

Seems you would have lost your money...

Meanwhile I've found a way to get the "good" performance in the micro
benchmark. Unfortunately this requires to switch off the pv interfaces
in the HVM guest via "xen_nopv" kernel boot parameter.

I have verified that pv spinlocks are not to blame (via "xen_nopvspin"
kernel boot parameter). Switching to clocksource TSC in the running
system doesn't help either.

Unfortunately the kernel seems no longer to be functional when I try to
tweak it not to use the PVHVM enhancements. I'm wondering now whether
there have ever been any benchmarks to proof PVHVM really being faster
than non-PVHVM? My findings seem to suggest there might be a huge
performance gap with PVHVM. OTOH this might depend on hardware and other
factors.

Stefano, didn't you do the PVHVM stuff back in 2010? Do you have any
data from then regarding performance figures?


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-06-06 13:44       ` Juergen Gross
@ 2017-06-06 16:39         ` Stefano Stabellini
  2017-06-06 19:00           ` Juergen Gross
  0 siblings, 1 reply; 27+ messages in thread
From: Stefano Stabellini @ 2017-06-06 16:39 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel, Stefano Stabellini, Ian Jackson, Wei Liu

On Tue, 6 Jun 2017, Juergen Gross wrote:
> On 26/05/17 21:01, Stefano Stabellini wrote:
> > On Fri, 26 May 2017, Juergen Gross wrote:
> >> On 26/05/17 18:19, Ian Jackson wrote:
> >>> Juergen Gross writes ("HVM guest performance regression"):
> >>>> Looking for the reason of a performance regression of HVM guests under
> >>>> Xen 4.7 against 4.5 I found the reason to be commit
> >>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
> >>>> in Xen 4.6.
> >>>>
> >>>> The problem occurred when dom0 had to be ballooned down when starting
> >>>> the guest. The performance of some micro benchmarks dropped by about
> >>>> a factor of 2 with above commit.
> >>>>
> >>>> Interesting point is that the performance of the guest will depend on
> >>>> the amount of free memory being available at guest creation time.
> >>>> When there was barely enough memory available for starting the guest
> >>>> the performance will remain low even if memory is being freed later.
> >>>>
> >>>> I'd like to suggest we either revert the commit or have some other
> >>>> mechanism to try to have some reserve free memory when starting a
> >>>> domain.
> >>>
> >>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
> >>> going to drain that swamp now, but I don't like regressions.
> >>>
> >>> I am not opposed to reverting that commit.  I was a bit iffy about it
> >>> at the time; and according to the removal commit message, it was
> >>> basically removed because it was a piece of cargo cult for which we
> >>> had no justification in any of our records.
> >>>
> >>> Indeed I think fixing this is a candidate for 4.9.
> >>>
> >>> Do you know the mechanism by which the freemem slack helps ?  I think
> >>> that would be a prerequisite for reverting this.  That way we can have
> >>> an understanding of why we are doing things, rather than just
> >>> flailing at random...
> >>
> >> I wish I would understand it.
> >>
> >> One candidate would be 2M/1G pages being possible with enough free
> >> memory, but I haven't proofed this yet. I can have a try by disabling
> >> big pages in the hypervisor.
> > 
> > Right, if I had to bet, I would put my money on superpages shattering
> > being the cause of the problem.
> 
> Seems you would have lost your money...
> 
> Meanwhile I've found a way to get the "good" performance in the micro
> benchmark. Unfortunately this requires to switch off the pv interfaces
> in the HVM guest via "xen_nopv" kernel boot parameter.
> 
> I have verified that pv spinlocks are not to blame (via "xen_nopvspin"
> kernel boot parameter). Switching to clocksource TSC in the running
> system doesn't help either.

What about xen_hvm_exit_mmap (an optimization for shadow pagetables) and
xen_hvm_smp_init (PV IPI)?


> Unfortunately the kernel seems no longer to be functional when I try to
> tweak it not to use the PVHVM enhancements.

I guess you are not talking about regular PV drivers like netfront and
blkfront, right?


> I'm wondering now whether
> there have ever been any benchmarks to proof PVHVM really being faster
> than non-PVHVM? My findings seem to suggest there might be a huge
> performance gap with PVHVM. OTOH this might depend on hardware and other
> factors.
> 
> Stefano, didn't you do the PVHVM stuff back in 2010? Do you have any
> data from then regarding performance figures?

Yes, I still have these slides:

https://www.slideshare.net/xen_com_mgr/linux-pv-on-hvm

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-06-06 16:39         ` Stefano Stabellini
@ 2017-06-06 19:00           ` Juergen Gross
  2017-06-06 19:08             ` Stefano Stabellini
  0 siblings, 1 reply; 27+ messages in thread
From: Juergen Gross @ 2017-06-06 19:00 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Ian Jackson, Wei Liu

On 06/06/17 18:39, Stefano Stabellini wrote:
> On Tue, 6 Jun 2017, Juergen Gross wrote:
>> On 26/05/17 21:01, Stefano Stabellini wrote:
>>> On Fri, 26 May 2017, Juergen Gross wrote:
>>>> On 26/05/17 18:19, Ian Jackson wrote:
>>>>> Juergen Gross writes ("HVM guest performance regression"):
>>>>>> Looking for the reason of a performance regression of HVM guests under
>>>>>> Xen 4.7 against 4.5 I found the reason to be commit
>>>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
>>>>>> in Xen 4.6.
>>>>>>
>>>>>> The problem occurred when dom0 had to be ballooned down when starting
>>>>>> the guest. The performance of some micro benchmarks dropped by about
>>>>>> a factor of 2 with above commit.
>>>>>>
>>>>>> Interesting point is that the performance of the guest will depend on
>>>>>> the amount of free memory being available at guest creation time.
>>>>>> When there was barely enough memory available for starting the guest
>>>>>> the performance will remain low even if memory is being freed later.
>>>>>>
>>>>>> I'd like to suggest we either revert the commit or have some other
>>>>>> mechanism to try to have some reserve free memory when starting a
>>>>>> domain.
>>>>>
>>>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
>>>>> going to drain that swamp now, but I don't like regressions.
>>>>>
>>>>> I am not opposed to reverting that commit.  I was a bit iffy about it
>>>>> at the time; and according to the removal commit message, it was
>>>>> basically removed because it was a piece of cargo cult for which we
>>>>> had no justification in any of our records.
>>>>>
>>>>> Indeed I think fixing this is a candidate for 4.9.
>>>>>
>>>>> Do you know the mechanism by which the freemem slack helps ?  I think
>>>>> that would be a prerequisite for reverting this.  That way we can have
>>>>> an understanding of why we are doing things, rather than just
>>>>> flailing at random...
>>>>
>>>> I wish I would understand it.
>>>>
>>>> One candidate would be 2M/1G pages being possible with enough free
>>>> memory, but I haven't proofed this yet. I can have a try by disabling
>>>> big pages in the hypervisor.
>>>
>>> Right, if I had to bet, I would put my money on superpages shattering
>>> being the cause of the problem.
>>
>> Seems you would have lost your money...
>>
>> Meanwhile I've found a way to get the "good" performance in the micro
>> benchmark. Unfortunately this requires to switch off the pv interfaces
>> in the HVM guest via "xen_nopv" kernel boot parameter.
>>
>> I have verified that pv spinlocks are not to blame (via "xen_nopvspin"
>> kernel boot parameter). Switching to clocksource TSC in the running
>> system doesn't help either.
> 
> What about xen_hvm_exit_mmap (an optimization for shadow pagetables) and
> xen_hvm_smp_init (PV IPI)?

xen_hvm_exit_mmap isn't active (kernel message telling me so was
issued).

>> Unfortunately the kernel seems no longer to be functional when I try to
>> tweak it not to use the PVHVM enhancements.
> 
> I guess you are not talking about regular PV drivers like netfront and
> blkfront, right?

The plan was to be able to use PV drivers without having to use PV
callbacks and PV timers. This isn't possible right now.

>> I'm wondering now whether
>> there have ever been any benchmarks to proof PVHVM really being faster
>> than non-PVHVM? My findings seem to suggest there might be a huge
>> performance gap with PVHVM. OTOH this might depend on hardware and other
>> factors.
>>
>> Stefano, didn't you do the PVHVM stuff back in 2010? Do you have any
>> data from then regarding performance figures?
> 
> Yes, I still have these slides:
> 
> https://www.slideshare.net/xen_com_mgr/linux-pv-on-hvm

Thanks. So you measured the overall package, not the single items like
callbacks, timers, time source? I'm asking because I start to believe
there are some of those slower than their non-PV variants.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-06-06 19:00           ` Juergen Gross
@ 2017-06-06 19:08             ` Stefano Stabellini
  2017-06-07  6:55               ` Juergen Gross
  0 siblings, 1 reply; 27+ messages in thread
From: Stefano Stabellini @ 2017-06-06 19:08 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel, Stefano Stabellini, Ian Jackson, Wei Liu

On Tue, 6 Jun 2017, Juergen Gross wrote:
> On 06/06/17 18:39, Stefano Stabellini wrote:
> > On Tue, 6 Jun 2017, Juergen Gross wrote:
> >> On 26/05/17 21:01, Stefano Stabellini wrote:
> >>> On Fri, 26 May 2017, Juergen Gross wrote:
> >>>> On 26/05/17 18:19, Ian Jackson wrote:
> >>>>> Juergen Gross writes ("HVM guest performance regression"):
> >>>>>> Looking for the reason of a performance regression of HVM guests under
> >>>>>> Xen 4.7 against 4.5 I found the reason to be commit
> >>>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
> >>>>>> in Xen 4.6.
> >>>>>>
> >>>>>> The problem occurred when dom0 had to be ballooned down when starting
> >>>>>> the guest. The performance of some micro benchmarks dropped by about
> >>>>>> a factor of 2 with above commit.
> >>>>>>
> >>>>>> Interesting point is that the performance of the guest will depend on
> >>>>>> the amount of free memory being available at guest creation time.
> >>>>>> When there was barely enough memory available for starting the guest
> >>>>>> the performance will remain low even if memory is being freed later.
> >>>>>>
> >>>>>> I'd like to suggest we either revert the commit or have some other
> >>>>>> mechanism to try to have some reserve free memory when starting a
> >>>>>> domain.
> >>>>>
> >>>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
> >>>>> going to drain that swamp now, but I don't like regressions.
> >>>>>
> >>>>> I am not opposed to reverting that commit.  I was a bit iffy about it
> >>>>> at the time; and according to the removal commit message, it was
> >>>>> basically removed because it was a piece of cargo cult for which we
> >>>>> had no justification in any of our records.
> >>>>>
> >>>>> Indeed I think fixing this is a candidate for 4.9.
> >>>>>
> >>>>> Do you know the mechanism by which the freemem slack helps ?  I think
> >>>>> that would be a prerequisite for reverting this.  That way we can have
> >>>>> an understanding of why we are doing things, rather than just
> >>>>> flailing at random...
> >>>>
> >>>> I wish I would understand it.
> >>>>
> >>>> One candidate would be 2M/1G pages being possible with enough free
> >>>> memory, but I haven't proofed this yet. I can have a try by disabling
> >>>> big pages in the hypervisor.
> >>>
> >>> Right, if I had to bet, I would put my money on superpages shattering
> >>> being the cause of the problem.
> >>
> >> Seems you would have lost your money...
> >>
> >> Meanwhile I've found a way to get the "good" performance in the micro
> >> benchmark. Unfortunately this requires to switch off the pv interfaces
> >> in the HVM guest via "xen_nopv" kernel boot parameter.
> >>
> >> I have verified that pv spinlocks are not to blame (via "xen_nopvspin"
> >> kernel boot parameter). Switching to clocksource TSC in the running
> >> system doesn't help either.
> > 
> > What about xen_hvm_exit_mmap (an optimization for shadow pagetables) and
> > xen_hvm_smp_init (PV IPI)?
> 
> xen_hvm_exit_mmap isn't active (kernel message telling me so was
> issued).
> 
> >> Unfortunately the kernel seems no longer to be functional when I try to
> >> tweak it not to use the PVHVM enhancements.
> > 
> > I guess you are not talking about regular PV drivers like netfront and
> > blkfront, right?
> 
> The plan was to be able to use PV drivers without having to use PV
> callbacks and PV timers. This isn't possible right now.

I think the code to handle that scenario was gradually removed over time
to simplify the code base.


> >> I'm wondering now whether
> >> there have ever been any benchmarks to proof PVHVM really being faster
> >> than non-PVHVM? My findings seem to suggest there might be a huge
> >> performance gap with PVHVM. OTOH this might depend on hardware and other
> >> factors.
> >>
> >> Stefano, didn't you do the PVHVM stuff back in 2010? Do you have any
> >> data from then regarding performance figures?
> > 
> > Yes, I still have these slides:
> > 
> > https://www.slideshare.net/xen_com_mgr/linux-pv-on-hvm
> 
> Thanks. So you measured the overall package, not the single items like
> callbacks, timers, time source? I'm asking because I start to believe
> there are some of those slower than their non-PV variants.

There isn't much left in terms of individual optimizations: you already
tried switching clocksource and removing pv spinlocks. xen_hvm_exit_mmap
is not used. Only the following are left (you might want to double check
I haven't missed anything):

1) PV IPI
2) PV suspend/resume
3) vector callback
4) interrupt remapping

2) is not on the hot path.
I did individual measurements of 3) at some points and it was a clear win.
Slide 14 shows the individual measurements of 4)

Only 1) is left to check as far as I can tell.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-06-06 19:08             ` Stefano Stabellini
@ 2017-06-07  6:55               ` Juergen Gross
  2017-06-07 18:19                 ` Stefano Stabellini
  0 siblings, 1 reply; 27+ messages in thread
From: Juergen Gross @ 2017-06-07  6:55 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Ian Jackson, Wei Liu

On 06/06/17 21:08, Stefano Stabellini wrote:
> On Tue, 6 Jun 2017, Juergen Gross wrote:
>> On 06/06/17 18:39, Stefano Stabellini wrote:
>>> On Tue, 6 Jun 2017, Juergen Gross wrote:
>>>> On 26/05/17 21:01, Stefano Stabellini wrote:
>>>>> On Fri, 26 May 2017, Juergen Gross wrote:
>>>>>> On 26/05/17 18:19, Ian Jackson wrote:
>>>>>>> Juergen Gross writes ("HVM guest performance regression"):
>>>>>>>> Looking for the reason of a performance regression of HVM guests under
>>>>>>>> Xen 4.7 against 4.5 I found the reason to be commit
>>>>>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
>>>>>>>> in Xen 4.6.
>>>>>>>>
>>>>>>>> The problem occurred when dom0 had to be ballooned down when starting
>>>>>>>> the guest. The performance of some micro benchmarks dropped by about
>>>>>>>> a factor of 2 with above commit.
>>>>>>>>
>>>>>>>> Interesting point is that the performance of the guest will depend on
>>>>>>>> the amount of free memory being available at guest creation time.
>>>>>>>> When there was barely enough memory available for starting the guest
>>>>>>>> the performance will remain low even if memory is being freed later.
>>>>>>>>
>>>>>>>> I'd like to suggest we either revert the commit or have some other
>>>>>>>> mechanism to try to have some reserve free memory when starting a
>>>>>>>> domain.
>>>>>>>
>>>>>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
>>>>>>> going to drain that swamp now, but I don't like regressions.
>>>>>>>
>>>>>>> I am not opposed to reverting that commit.  I was a bit iffy about it
>>>>>>> at the time; and according to the removal commit message, it was
>>>>>>> basically removed because it was a piece of cargo cult for which we
>>>>>>> had no justification in any of our records.
>>>>>>>
>>>>>>> Indeed I think fixing this is a candidate for 4.9.
>>>>>>>
>>>>>>> Do you know the mechanism by which the freemem slack helps ?  I think
>>>>>>> that would be a prerequisite for reverting this.  That way we can have
>>>>>>> an understanding of why we are doing things, rather than just
>>>>>>> flailing at random...
>>>>>>
>>>>>> I wish I would understand it.
>>>>>>
>>>>>> One candidate would be 2M/1G pages being possible with enough free
>>>>>> memory, but I haven't proofed this yet. I can have a try by disabling
>>>>>> big pages in the hypervisor.
>>>>>
>>>>> Right, if I had to bet, I would put my money on superpages shattering
>>>>> being the cause of the problem.
>>>>
>>>> Seems you would have lost your money...
>>>>
>>>> Meanwhile I've found a way to get the "good" performance in the micro
>>>> benchmark. Unfortunately this requires to switch off the pv interfaces
>>>> in the HVM guest via "xen_nopv" kernel boot parameter.
>>>>
>>>> I have verified that pv spinlocks are not to blame (via "xen_nopvspin"
>>>> kernel boot parameter). Switching to clocksource TSC in the running
>>>> system doesn't help either.
>>>
>>> What about xen_hvm_exit_mmap (an optimization for shadow pagetables) and
>>> xen_hvm_smp_init (PV IPI)?
>>
>> xen_hvm_exit_mmap isn't active (kernel message telling me so was
>> issued).
>>
>>>> Unfortunately the kernel seems no longer to be functional when I try to
>>>> tweak it not to use the PVHVM enhancements.
>>>
>>> I guess you are not talking about regular PV drivers like netfront and
>>> blkfront, right?
>>
>> The plan was to be able to use PV drivers without having to use PV
>> callbacks and PV timers. This isn't possible right now.
> 
> I think the code to handle that scenario was gradually removed over time
> to simplify the code base.

Hmm, too bad.

>>>> I'm wondering now whether
>>>> there have ever been any benchmarks to proof PVHVM really being faster
>>>> than non-PVHVM? My findings seem to suggest there might be a huge
>>>> performance gap with PVHVM. OTOH this might depend on hardware and other
>>>> factors.
>>>>
>>>> Stefano, didn't you do the PVHVM stuff back in 2010? Do you have any
>>>> data from then regarding performance figures?
>>>
>>> Yes, I still have these slides:
>>>
>>> https://www.slideshare.net/xen_com_mgr/linux-pv-on-hvm
>>
>> Thanks. So you measured the overall package, not the single items like
>> callbacks, timers, time source? I'm asking because I start to believe
>> there are some of those slower than their non-PV variants.
> 
> There isn't much left in terms of individual optimizations: you already
> tried switching clocksource and removing pv spinlocks. xen_hvm_exit_mmap
> is not used. Only the following are left (you might want to double check
> I haven't missed anything):
> 
> 1) PV IPI

Its a 1 vcpu guest.

> 2) PV suspend/resume
> 3) vector callback
> 4) interrupt remapping
> 
> 2) is not on the hot path.
> I did individual measurements of 3) at some points and it was a clear win.

That might depend on the hardware. Could it be newer processors are
faster here?

> Slide 14 shows the individual measurements of 4)

I don't think this is affecting my benchmark. It is just munmap after
all.

> 
> Only 1) is left to check as far as I can tell.

No IPIs should be involved.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-06-07  6:55               ` Juergen Gross
@ 2017-06-07 18:19                 ` Stefano Stabellini
  2017-06-08  9:37                   ` Juergen Gross
  0 siblings, 1 reply; 27+ messages in thread
From: Stefano Stabellini @ 2017-06-07 18:19 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel, Stefano Stabellini, Ian Jackson, Wei Liu

On Wed, 7 Jun 2017, Juergen Gross wrote:
> On 06/06/17 21:08, Stefano Stabellini wrote:
> > On Tue, 6 Jun 2017, Juergen Gross wrote:
> >> On 06/06/17 18:39, Stefano Stabellini wrote:
> >>> On Tue, 6 Jun 2017, Juergen Gross wrote:
> >>>> On 26/05/17 21:01, Stefano Stabellini wrote:
> >>>>> On Fri, 26 May 2017, Juergen Gross wrote:
> >>>>>> On 26/05/17 18:19, Ian Jackson wrote:
> >>>>>>> Juergen Gross writes ("HVM guest performance regression"):
> >>>>>>>> Looking for the reason of a performance regression of HVM guests under
> >>>>>>>> Xen 4.7 against 4.5 I found the reason to be commit
> >>>>>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
> >>>>>>>> in Xen 4.6.
> >>>>>>>>
> >>>>>>>> The problem occurred when dom0 had to be ballooned down when starting
> >>>>>>>> the guest. The performance of some micro benchmarks dropped by about
> >>>>>>>> a factor of 2 with above commit.
> >>>>>>>>
> >>>>>>>> Interesting point is that the performance of the guest will depend on
> >>>>>>>> the amount of free memory being available at guest creation time.
> >>>>>>>> When there was barely enough memory available for starting the guest
> >>>>>>>> the performance will remain low even if memory is being freed later.
> >>>>>>>>
> >>>>>>>> I'd like to suggest we either revert the commit or have some other
> >>>>>>>> mechanism to try to have some reserve free memory when starting a
> >>>>>>>> domain.
> >>>>>>>
> >>>>>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
> >>>>>>> going to drain that swamp now, but I don't like regressions.
> >>>>>>>
> >>>>>>> I am not opposed to reverting that commit.  I was a bit iffy about it
> >>>>>>> at the time; and according to the removal commit message, it was
> >>>>>>> basically removed because it was a piece of cargo cult for which we
> >>>>>>> had no justification in any of our records.
> >>>>>>>
> >>>>>>> Indeed I think fixing this is a candidate for 4.9.
> >>>>>>>
> >>>>>>> Do you know the mechanism by which the freemem slack helps ?  I think
> >>>>>>> that would be a prerequisite for reverting this.  That way we can have
> >>>>>>> an understanding of why we are doing things, rather than just
> >>>>>>> flailing at random...
> >>>>>>
> >>>>>> I wish I would understand it.
> >>>>>>
> >>>>>> One candidate would be 2M/1G pages being possible with enough free
> >>>>>> memory, but I haven't proofed this yet. I can have a try by disabling
> >>>>>> big pages in the hypervisor.
> >>>>>
> >>>>> Right, if I had to bet, I would put my money on superpages shattering
> >>>>> being the cause of the problem.
> >>>>
> >>>> Seems you would have lost your money...
> >>>>
> >>>> Meanwhile I've found a way to get the "good" performance in the micro
> >>>> benchmark. Unfortunately this requires to switch off the pv interfaces
> >>>> in the HVM guest via "xen_nopv" kernel boot parameter.
> >>>>
> >>>> I have verified that pv spinlocks are not to blame (via "xen_nopvspin"
> >>>> kernel boot parameter). Switching to clocksource TSC in the running
> >>>> system doesn't help either.
> >>>
> >>> What about xen_hvm_exit_mmap (an optimization for shadow pagetables) and
> >>> xen_hvm_smp_init (PV IPI)?
> >>
> >> xen_hvm_exit_mmap isn't active (kernel message telling me so was
> >> issued).
> >>
> >>>> Unfortunately the kernel seems no longer to be functional when I try to
> >>>> tweak it not to use the PVHVM enhancements.
> >>>
> >>> I guess you are not talking about regular PV drivers like netfront and
> >>> blkfront, right?
> >>
> >> The plan was to be able to use PV drivers without having to use PV
> >> callbacks and PV timers. This isn't possible right now.
> > 
> > I think the code to handle that scenario was gradually removed over time
> > to simplify the code base.
> 
> Hmm, too bad.
> 
> >>>> I'm wondering now whether
> >>>> there have ever been any benchmarks to proof PVHVM really being faster
> >>>> than non-PVHVM? My findings seem to suggest there might be a huge
> >>>> performance gap with PVHVM. OTOH this might depend on hardware and other
> >>>> factors.
> >>>>
> >>>> Stefano, didn't you do the PVHVM stuff back in 2010? Do you have any
> >>>> data from then regarding performance figures?
> >>>
> >>> Yes, I still have these slides:
> >>>
> >>> https://www.slideshare.net/xen_com_mgr/linux-pv-on-hvm
> >>
> >> Thanks. So you measured the overall package, not the single items like
> >> callbacks, timers, time source? I'm asking because I start to believe
> >> there are some of those slower than their non-PV variants.
> > 
> > There isn't much left in terms of individual optimizations: you already
> > tried switching clocksource and removing pv spinlocks. xen_hvm_exit_mmap
> > is not used. Only the following are left (you might want to double check
> > I haven't missed anything):
> > 
> > 1) PV IPI
> 
> Its a 1 vcpu guest.
> 
> > 2) PV suspend/resume
> > 3) vector callback
> > 4) interrupt remapping
> > 
> > 2) is not on the hot path.
> > I did individual measurements of 3) at some points and it was a clear win.
> 
> That might depend on the hardware. Could it be newer processors are
> faster here?

I don't think so: the alternative it's an emulated interrupt. It's
slower under all points of view.

I would try to run the test with xen_emul_unplug="never" which means
that you are going to end up using the emulated network card and
emulated IDE controller, but some of the other optimizations (like the
vector callback) will still be active.

If the cause of the problem is ballooning for example, using emulated
interfaces for IO will reduce the amount of ballooned out pages
significantly.


> > Slide 14 shows the individual measurements of 4)
> 
> I don't think this is affecting my benchmark. It is just munmap after
> all.

I agree.


> > Only 1) is left to check as far as I can tell.
> 
> No IPIs should be involved.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-06-07 18:19                 ` Stefano Stabellini
@ 2017-06-08  9:37                   ` Juergen Gross
  2017-06-08 18:09                     ` Stefano Stabellini
  2017-06-08 21:00                     ` Dario Faggioli
  0 siblings, 2 replies; 27+ messages in thread
From: Juergen Gross @ 2017-06-08  9:37 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Ian Jackson, Wei Liu

On 07/06/17 20:19, Stefano Stabellini wrote:
> On Wed, 7 Jun 2017, Juergen Gross wrote:
>> On 06/06/17 21:08, Stefano Stabellini wrote:
>>> On Tue, 6 Jun 2017, Juergen Gross wrote:
>>>> On 06/06/17 18:39, Stefano Stabellini wrote:
>>>>> On Tue, 6 Jun 2017, Juergen Gross wrote:
>>>>>> On 26/05/17 21:01, Stefano Stabellini wrote:
>>>>>>> On Fri, 26 May 2017, Juergen Gross wrote:
>>>>>>>> On 26/05/17 18:19, Ian Jackson wrote:
>>>>>>>>> Juergen Gross writes ("HVM guest performance regression"):
>>>>>>>>>> Looking for the reason of a performance regression of HVM guests under
>>>>>>>>>> Xen 4.7 against 4.5 I found the reason to be commit
>>>>>>>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
>>>>>>>>>> in Xen 4.6.
>>>>>>>>>>
>>>>>>>>>> The problem occurred when dom0 had to be ballooned down when starting
>>>>>>>>>> the guest. The performance of some micro benchmarks dropped by about
>>>>>>>>>> a factor of 2 with above commit.
>>>>>>>>>>
>>>>>>>>>> Interesting point is that the performance of the guest will depend on
>>>>>>>>>> the amount of free memory being available at guest creation time.
>>>>>>>>>> When there was barely enough memory available for starting the guest
>>>>>>>>>> the performance will remain low even if memory is being freed later.
>>>>>>>>>>
>>>>>>>>>> I'd like to suggest we either revert the commit or have some other
>>>>>>>>>> mechanism to try to have some reserve free memory when starting a
>>>>>>>>>> domain.
>>>>>>>>>
>>>>>>>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
>>>>>>>>> going to drain that swamp now, but I don't like regressions.
>>>>>>>>>
>>>>>>>>> I am not opposed to reverting that commit.  I was a bit iffy about it
>>>>>>>>> at the time; and according to the removal commit message, it was
>>>>>>>>> basically removed because it was a piece of cargo cult for which we
>>>>>>>>> had no justification in any of our records.
>>>>>>>>>
>>>>>>>>> Indeed I think fixing this is a candidate for 4.9.
>>>>>>>>>
>>>>>>>>> Do you know the mechanism by which the freemem slack helps ?  I think
>>>>>>>>> that would be a prerequisite for reverting this.  That way we can have
>>>>>>>>> an understanding of why we are doing things, rather than just
>>>>>>>>> flailing at random...
>>>>>>>>
>>>>>>>> I wish I would understand it.
>>>>>>>>
>>>>>>>> One candidate would be 2M/1G pages being possible with enough free
>>>>>>>> memory, but I haven't proofed this yet. I can have a try by disabling
>>>>>>>> big pages in the hypervisor.
>>>>>>>
>>>>>>> Right, if I had to bet, I would put my money on superpages shattering
>>>>>>> being the cause of the problem.
>>>>>>
>>>>>> Seems you would have lost your money...
>>>>>>
>>>>>> Meanwhile I've found a way to get the "good" performance in the micro
>>>>>> benchmark. Unfortunately this requires to switch off the pv interfaces
>>>>>> in the HVM guest via "xen_nopv" kernel boot parameter.
>>>>>>
>>>>>> I have verified that pv spinlocks are not to blame (via "xen_nopvspin"
>>>>>> kernel boot parameter). Switching to clocksource TSC in the running
>>>>>> system doesn't help either.
>>>>>
>>>>> What about xen_hvm_exit_mmap (an optimization for shadow pagetables) and
>>>>> xen_hvm_smp_init (PV IPI)?
>>>>
>>>> xen_hvm_exit_mmap isn't active (kernel message telling me so was
>>>> issued).
>>>>
>>>>>> Unfortunately the kernel seems no longer to be functional when I try to
>>>>>> tweak it not to use the PVHVM enhancements.
>>>>>
>>>>> I guess you are not talking about regular PV drivers like netfront and
>>>>> blkfront, right?
>>>>
>>>> The plan was to be able to use PV drivers without having to use PV
>>>> callbacks and PV timers. This isn't possible right now.
>>>
>>> I think the code to handle that scenario was gradually removed over time
>>> to simplify the code base.
>>
>> Hmm, too bad.
>>
>>>>>> I'm wondering now whether
>>>>>> there have ever been any benchmarks to proof PVHVM really being faster
>>>>>> than non-PVHVM? My findings seem to suggest there might be a huge
>>>>>> performance gap with PVHVM. OTOH this might depend on hardware and other
>>>>>> factors.
>>>>>>
>>>>>> Stefano, didn't you do the PVHVM stuff back in 2010? Do you have any
>>>>>> data from then regarding performance figures?
>>>>>
>>>>> Yes, I still have these slides:
>>>>>
>>>>> https://www.slideshare.net/xen_com_mgr/linux-pv-on-hvm
>>>>
>>>> Thanks. So you measured the overall package, not the single items like
>>>> callbacks, timers, time source? I'm asking because I start to believe
>>>> there are some of those slower than their non-PV variants.
>>>
>>> There isn't much left in terms of individual optimizations: you already
>>> tried switching clocksource and removing pv spinlocks. xen_hvm_exit_mmap
>>> is not used. Only the following are left (you might want to double check
>>> I haven't missed anything):
>>>
>>> 1) PV IPI
>>
>> Its a 1 vcpu guest.
>>
>>> 2) PV suspend/resume
>>> 3) vector callback
>>> 4) interrupt remapping
>>>
>>> 2) is not on the hot path.
>>> I did individual measurements of 3) at some points and it was a clear win.
>>
>> That might depend on the hardware. Could it be newer processors are
>> faster here?
> 
> I don't think so: the alternative it's an emulated interrupt. It's
> slower under all points of view.

What about APIC virtualization of modern processors? Are you sure e.g.
timer interrupts aren't handled completely by the processor? I guess
this might be faster than letting it be handled by the hypervisor and
then use the callback into the guest.

> I would try to run the test with xen_emul_unplug="never" which means
> that you are going to end up using the emulated network card and
> emulated IDE controller, but some of the other optimizations (like the
> vector callback) will still be active.

Now this is something I wouldn't like to do. My test isn't using any
I/O at all and is showing bad performance with pv interfaces being used.
The only remedy right now seems to be to switch off pv interfaces
leading to a bad I/O performance, but a good non-I/O performance.

You are suggesting a mode with bad I/O performance _and_ bad non-I/O
performance.

> If the cause of the problem is ballooning for example, using emulated
> interfaces for IO will reduce the amount of ballooned out pages
> significantly.

No I/O involved in my benchmark.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-06-08  9:37                   ` Juergen Gross
@ 2017-06-08 18:09                     ` Stefano Stabellini
  2017-06-08 18:28                       ` Juergen Gross
  2017-06-08 21:00                     ` Dario Faggioli
  1 sibling, 1 reply; 27+ messages in thread
From: Stefano Stabellini @ 2017-06-08 18:09 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel, Stefano Stabellini, Ian Jackson, Wei Liu

On Thu, 8 Jun 2017, Juergen Gross wrote:
> On 07/06/17 20:19, Stefano Stabellini wrote:
> > On Wed, 7 Jun 2017, Juergen Gross wrote:
> >> On 06/06/17 21:08, Stefano Stabellini wrote:
> >>> On Tue, 6 Jun 2017, Juergen Gross wrote:
> >>>> On 06/06/17 18:39, Stefano Stabellini wrote:
> >>>>> On Tue, 6 Jun 2017, Juergen Gross wrote:
> >>>>>> On 26/05/17 21:01, Stefano Stabellini wrote:
> >>>>>>> On Fri, 26 May 2017, Juergen Gross wrote:
> >>>>>>>> On 26/05/17 18:19, Ian Jackson wrote:
> >>>>>>>>> Juergen Gross writes ("HVM guest performance regression"):
> >>>>>>>>>> Looking for the reason of a performance regression of HVM guests under
> >>>>>>>>>> Xen 4.7 against 4.5 I found the reason to be commit
> >>>>>>>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
> >>>>>>>>>> in Xen 4.6.
> >>>>>>>>>>
> >>>>>>>>>> The problem occurred when dom0 had to be ballooned down when starting
> >>>>>>>>>> the guest. The performance of some micro benchmarks dropped by about
> >>>>>>>>>> a factor of 2 with above commit.
> >>>>>>>>>>
> >>>>>>>>>> Interesting point is that the performance of the guest will depend on
> >>>>>>>>>> the amount of free memory being available at guest creation time.
> >>>>>>>>>> When there was barely enough memory available for starting the guest
> >>>>>>>>>> the performance will remain low even if memory is being freed later.
> >>>>>>>>>>
> >>>>>>>>>> I'd like to suggest we either revert the commit or have some other
> >>>>>>>>>> mechanism to try to have some reserve free memory when starting a
> >>>>>>>>>> domain.
> >>>>>>>>>
> >>>>>>>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
> >>>>>>>>> going to drain that swamp now, but I don't like regressions.
> >>>>>>>>>
> >>>>>>>>> I am not opposed to reverting that commit.  I was a bit iffy about it
> >>>>>>>>> at the time; and according to the removal commit message, it was
> >>>>>>>>> basically removed because it was a piece of cargo cult for which we
> >>>>>>>>> had no justification in any of our records.
> >>>>>>>>>
> >>>>>>>>> Indeed I think fixing this is a candidate for 4.9.
> >>>>>>>>>
> >>>>>>>>> Do you know the mechanism by which the freemem slack helps ?  I think
> >>>>>>>>> that would be a prerequisite for reverting this.  That way we can have
> >>>>>>>>> an understanding of why we are doing things, rather than just
> >>>>>>>>> flailing at random...
> >>>>>>>>
> >>>>>>>> I wish I would understand it.
> >>>>>>>>
> >>>>>>>> One candidate would be 2M/1G pages being possible with enough free
> >>>>>>>> memory, but I haven't proofed this yet. I can have a try by disabling
> >>>>>>>> big pages in the hypervisor.
> >>>>>>>
> >>>>>>> Right, if I had to bet, I would put my money on superpages shattering
> >>>>>>> being the cause of the problem.
> >>>>>>
> >>>>>> Seems you would have lost your money...
> >>>>>>
> >>>>>> Meanwhile I've found a way to get the "good" performance in the micro
> >>>>>> benchmark. Unfortunately this requires to switch off the pv interfaces
> >>>>>> in the HVM guest via "xen_nopv" kernel boot parameter.
> >>>>>>
> >>>>>> I have verified that pv spinlocks are not to blame (via "xen_nopvspin"
> >>>>>> kernel boot parameter). Switching to clocksource TSC in the running
> >>>>>> system doesn't help either.
> >>>>>
> >>>>> What about xen_hvm_exit_mmap (an optimization for shadow pagetables) and
> >>>>> xen_hvm_smp_init (PV IPI)?
> >>>>
> >>>> xen_hvm_exit_mmap isn't active (kernel message telling me so was
> >>>> issued).
> >>>>
> >>>>>> Unfortunately the kernel seems no longer to be functional when I try to
> >>>>>> tweak it not to use the PVHVM enhancements.
> >>>>>
> >>>>> I guess you are not talking about regular PV drivers like netfront and
> >>>>> blkfront, right?
> >>>>
> >>>> The plan was to be able to use PV drivers without having to use PV
> >>>> callbacks and PV timers. This isn't possible right now.
> >>>
> >>> I think the code to handle that scenario was gradually removed over time
> >>> to simplify the code base.
> >>
> >> Hmm, too bad.
> >>
> >>>>>> I'm wondering now whether
> >>>>>> there have ever been any benchmarks to proof PVHVM really being faster
> >>>>>> than non-PVHVM? My findings seem to suggest there might be a huge
> >>>>>> performance gap with PVHVM. OTOH this might depend on hardware and other
> >>>>>> factors.
> >>>>>>
> >>>>>> Stefano, didn't you do the PVHVM stuff back in 2010? Do you have any
> >>>>>> data from then regarding performance figures?
> >>>>>
> >>>>> Yes, I still have these slides:
> >>>>>
> >>>>> https://www.slideshare.net/xen_com_mgr/linux-pv-on-hvm
> >>>>
> >>>> Thanks. So you measured the overall package, not the single items like
> >>>> callbacks, timers, time source? I'm asking because I start to believe
> >>>> there are some of those slower than their non-PV variants.
> >>>
> >>> There isn't much left in terms of individual optimizations: you already
> >>> tried switching clocksource and removing pv spinlocks. xen_hvm_exit_mmap
> >>> is not used. Only the following are left (you might want to double check
> >>> I haven't missed anything):
> >>>
> >>> 1) PV IPI
> >>
> >> Its a 1 vcpu guest.
> >>
> >>> 2) PV suspend/resume
> >>> 3) vector callback
> >>> 4) interrupt remapping
> >>>
> >>> 2) is not on the hot path.
> >>> I did individual measurements of 3) at some points and it was a clear win.
> >>
> >> That might depend on the hardware. Could it be newer processors are
> >> faster here?
> > 
> > I don't think so: the alternative it's an emulated interrupt. It's
> > slower under all points of view.
> 
> What about APIC virtualization of modern processors? Are you sure e.g.
> timer interrupts aren't handled completely by the processor? I guess
> this might be faster than letting it be handled by the hypervisor and
> then use the callback into the guest.
> 
> > I would try to run the test with xen_emul_unplug="never" which means
> > that you are going to end up using the emulated network card and
> > emulated IDE controller, but some of the other optimizations (like the
> > vector callback) will still be active.
> 
> Now this is something I wouldn't like to do. My test isn't using any
> I/O at all and is showing bad performance with pv interfaces being used.
> The only remedy right now seems to be to switch off pv interfaces
> leading to a bad I/O performance, but a good non-I/O performance.
> 
> You are suggesting a mode with bad I/O performance _and_ bad non-I/O
> performance.

I was only suggesting this for debugging, to better understand the
problem, not as a solution.


> > If the cause of the problem is ballooning for example, using emulated
> > interfaces for IO will reduce the amount of ballooned out pages
> > significantly.
> 
> No I/O involved in my benchmark.

I admit that if your test doesn't do any I/O, it is not likely that
xen_emul_unplug="never" will help us understand the problem.

Nonetheless, I believe that a simple blkfront/blkback or
netfront/netback connection, even without any I/O being done, leads to a
couple of calls into the ballooning code (xenbus_map_ring_valloc_hvm ->
alloc_xenballooned_pages).

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-06-08 18:09                     ` Stefano Stabellini
@ 2017-06-08 18:28                       ` Juergen Gross
  0 siblings, 0 replies; 27+ messages in thread
From: Juergen Gross @ 2017-06-08 18:28 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Ian Jackson, Wei Liu

On 08/06/17 20:09, Stefano Stabellini wrote:
> On Thu, 8 Jun 2017, Juergen Gross wrote:
>> On 07/06/17 20:19, Stefano Stabellini wrote:
>>> On Wed, 7 Jun 2017, Juergen Gross wrote:
>>>> On 06/06/17 21:08, Stefano Stabellini wrote:
>>>>> On Tue, 6 Jun 2017, Juergen Gross wrote:
>>>>>> On 06/06/17 18:39, Stefano Stabellini wrote:
>>>>>>> On Tue, 6 Jun 2017, Juergen Gross wrote:
>>>>>>>> On 26/05/17 21:01, Stefano Stabellini wrote:
>>>>>>>>> On Fri, 26 May 2017, Juergen Gross wrote:
>>>>>>>>>> On 26/05/17 18:19, Ian Jackson wrote:
>>>>>>>>>>> Juergen Gross writes ("HVM guest performance regression"):
>>>>>>>>>>>> Looking for the reason of a performance regression of HVM guests under
>>>>>>>>>>>> Xen 4.7 against 4.5 I found the reason to be commit
>>>>>>>>>>>> c26f92b8fce3c9df17f7ef035b54d97cbe931c7a ("libxl: remove freemem_slack")
>>>>>>>>>>>> in Xen 4.6.
>>>>>>>>>>>>
>>>>>>>>>>>> The problem occurred when dom0 had to be ballooned down when starting
>>>>>>>>>>>> the guest. The performance of some micro benchmarks dropped by about
>>>>>>>>>>>> a factor of 2 with above commit.
>>>>>>>>>>>>
>>>>>>>>>>>> Interesting point is that the performance of the guest will depend on
>>>>>>>>>>>> the amount of free memory being available at guest creation time.
>>>>>>>>>>>> When there was barely enough memory available for starting the guest
>>>>>>>>>>>> the performance will remain low even if memory is being freed later.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to suggest we either revert the commit or have some other
>>>>>>>>>>>> mechanism to try to have some reserve free memory when starting a
>>>>>>>>>>>> domain.
>>>>>>>>>>>
>>>>>>>>>>> Oh, dear.  The memory accounting swamp again.  Clearly we are not
>>>>>>>>>>> going to drain that swamp now, but I don't like regressions.
>>>>>>>>>>>
>>>>>>>>>>> I am not opposed to reverting that commit.  I was a bit iffy about it
>>>>>>>>>>> at the time; and according to the removal commit message, it was
>>>>>>>>>>> basically removed because it was a piece of cargo cult for which we
>>>>>>>>>>> had no justification in any of our records.
>>>>>>>>>>>
>>>>>>>>>>> Indeed I think fixing this is a candidate for 4.9.
>>>>>>>>>>>
>>>>>>>>>>> Do you know the mechanism by which the freemem slack helps ?  I think
>>>>>>>>>>> that would be a prerequisite for reverting this.  That way we can have
>>>>>>>>>>> an understanding of why we are doing things, rather than just
>>>>>>>>>>> flailing at random...
>>>>>>>>>>
>>>>>>>>>> I wish I would understand it.
>>>>>>>>>>
>>>>>>>>>> One candidate would be 2M/1G pages being possible with enough free
>>>>>>>>>> memory, but I haven't proofed this yet. I can have a try by disabling
>>>>>>>>>> big pages in the hypervisor.
>>>>>>>>>
>>>>>>>>> Right, if I had to bet, I would put my money on superpages shattering
>>>>>>>>> being the cause of the problem.
>>>>>>>>
>>>>>>>> Seems you would have lost your money...
>>>>>>>>
>>>>>>>> Meanwhile I've found a way to get the "good" performance in the micro
>>>>>>>> benchmark. Unfortunately this requires to switch off the pv interfaces
>>>>>>>> in the HVM guest via "xen_nopv" kernel boot parameter.
>>>>>>>>
>>>>>>>> I have verified that pv spinlocks are not to blame (via "xen_nopvspin"
>>>>>>>> kernel boot parameter). Switching to clocksource TSC in the running
>>>>>>>> system doesn't help either.
>>>>>>>
>>>>>>> What about xen_hvm_exit_mmap (an optimization for shadow pagetables) and
>>>>>>> xen_hvm_smp_init (PV IPI)?
>>>>>>
>>>>>> xen_hvm_exit_mmap isn't active (kernel message telling me so was
>>>>>> issued).
>>>>>>
>>>>>>>> Unfortunately the kernel seems no longer to be functional when I try to
>>>>>>>> tweak it not to use the PVHVM enhancements.
>>>>>>>
>>>>>>> I guess you are not talking about regular PV drivers like netfront and
>>>>>>> blkfront, right?
>>>>>>
>>>>>> The plan was to be able to use PV drivers without having to use PV
>>>>>> callbacks and PV timers. This isn't possible right now.
>>>>>
>>>>> I think the code to handle that scenario was gradually removed over time
>>>>> to simplify the code base.
>>>>
>>>> Hmm, too bad.
>>>>
>>>>>>>> I'm wondering now whether
>>>>>>>> there have ever been any benchmarks to proof PVHVM really being faster
>>>>>>>> than non-PVHVM? My findings seem to suggest there might be a huge
>>>>>>>> performance gap with PVHVM. OTOH this might depend on hardware and other
>>>>>>>> factors.
>>>>>>>>
>>>>>>>> Stefano, didn't you do the PVHVM stuff back in 2010? Do you have any
>>>>>>>> data from then regarding performance figures?
>>>>>>>
>>>>>>> Yes, I still have these slides:
>>>>>>>
>>>>>>> https://www.slideshare.net/xen_com_mgr/linux-pv-on-hvm
>>>>>>
>>>>>> Thanks. So you measured the overall package, not the single items like
>>>>>> callbacks, timers, time source? I'm asking because I start to believe
>>>>>> there are some of those slower than their non-PV variants.
>>>>>
>>>>> There isn't much left in terms of individual optimizations: you already
>>>>> tried switching clocksource and removing pv spinlocks. xen_hvm_exit_mmap
>>>>> is not used. Only the following are left (you might want to double check
>>>>> I haven't missed anything):
>>>>>
>>>>> 1) PV IPI
>>>>
>>>> Its a 1 vcpu guest.
>>>>
>>>>> 2) PV suspend/resume
>>>>> 3) vector callback
>>>>> 4) interrupt remapping
>>>>>
>>>>> 2) is not on the hot path.
>>>>> I did individual measurements of 3) at some points and it was a clear win.
>>>>
>>>> That might depend on the hardware. Could it be newer processors are
>>>> faster here?
>>>
>>> I don't think so: the alternative it's an emulated interrupt. It's
>>> slower under all points of view.
>>
>> What about APIC virtualization of modern processors? Are you sure e.g.
>> timer interrupts aren't handled completely by the processor? I guess
>> this might be faster than letting it be handled by the hypervisor and
>> then use the callback into the guest.
>>
>>> I would try to run the test with xen_emul_unplug="never" which means
>>> that you are going to end up using the emulated network card and
>>> emulated IDE controller, but some of the other optimizations (like the
>>> vector callback) will still be active.
>>
>> Now this is something I wouldn't like to do. My test isn't using any
>> I/O at all and is showing bad performance with pv interfaces being used.
>> The only remedy right now seems to be to switch off pv interfaces
>> leading to a bad I/O performance, but a good non-I/O performance.
>>
>> You are suggesting a mode with bad I/O performance _and_ bad non-I/O
>> performance.
> 
> I was only suggesting this for debugging, to better understand the
> problem, not as a solution.
> 
> 
>>> If the cause of the problem is ballooning for example, using emulated
>>> interfaces for IO will reduce the amount of ballooned out pages
>>> significantly.
>>
>> No I/O involved in my benchmark.
> 
> I admit that if your test doesn't do any I/O, it is not likely that
> xen_emul_unplug="never" will help us understand the problem.
> 
> Nonetheless, I believe that a simple blkfront/blkback or
> netfront/netback connection, even without any I/O being done, leads to a
> couple of calls into the ballooning code (xenbus_map_ring_valloc_hvm ->
> alloc_xenballooned_pages).

Only if the backend lives in a hvm domain. So in my case no problem, as
I have a classical pv dom0 hosting the backends.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-06-08  9:37                   ` Juergen Gross
  2017-06-08 18:09                     ` Stefano Stabellini
@ 2017-06-08 21:00                     ` Dario Faggioli
  2017-06-11  2:27                       ` Konrad Rzeszutek Wilk
  2017-06-12  5:48                       ` Solved: " Juergen Gross
  1 sibling, 2 replies; 27+ messages in thread
From: Dario Faggioli @ 2017-06-08 21:00 UTC (permalink / raw)
  To: Juergen Gross, Stefano Stabellini; +Cc: xen-devel, Ian Jackson, Wei Liu


[-- Attachment #1.1: Type: text/plain, Size: 1898 bytes --]

Bringing in Konrad because...

On Thu, 2017-06-08 at 11:37 +0200, Juergen Gross wrote:
> On 07/06/17 20:19, Stefano Stabellini wrote:
> > On Wed, 7 Jun 2017, Juergen Gross wrote:
> > > On 06/06/17 21:08, Stefano Stabellini wrote:
> > > > 
> > > > 2) PV suspend/resume
> > > > 3) vector callback
> > > > 4) interrupt remapping
> > > > 
> > > > 2) is not on the hot path.
> > > > I did individual measurements of 3) at some points and it was a
> > > > clear win.
> > > 
> > > That might depend on the hardware. Could it be newer processors
> > > are
> > > faster here?
> > 
> > I don't think so: the alternative it's an emulated interrupt. It's
> > slower under all points of view.
> 
> What about APIC virtualization of modern processors? Are you sure
> e.g.
> timer interrupts aren't handled completely by the processor? I guess
> this might be faster than letting it be handled by the hypervisor and
> then use the callback into the guest.
> 
... I kind of remember an email exchange we had, not here on the list,
but in private, about some apparently weird scheduling behavior you
were seeing, there at Oracle, on a particular benchmark/customer's
workload.

Not that this is directly related, but I seem to also recall that you
managed to find out that some of the perf difference (between baremetal
and guest) was due to vAPIC being faster than the PV path we were
taking? What I don't recall, though, is whether your guest was PV or
(PV)HVM... Do you remember anything more precisely than this?

It was like one or two years ago... (I'll dig in the archives for the
emails.)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [for-4.9] Re: HVM guest performance regression
  2017-06-08 21:00                     ` Dario Faggioli
@ 2017-06-11  2:27                       ` Konrad Rzeszutek Wilk
  2017-06-12  5:48                       ` Solved: " Juergen Gross
  1 sibling, 0 replies; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-06-11  2:27 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Juergen Gross, xen-devel, Stefano Stabellini, Ian Jackson, Wei Liu

On Thu, Jun 08, 2017 at 11:00:34PM +0200, Dario Faggioli wrote:
> Bringing in Konrad because...
> 
> On Thu, 2017-06-08 at 11:37 +0200, Juergen Gross wrote:
> > On 07/06/17 20:19, Stefano Stabellini wrote:
> > > On Wed, 7 Jun 2017, Juergen Gross wrote:
> > > > On 06/06/17 21:08, Stefano Stabellini wrote:
> > > > > 
> > > > > 2) PV suspend/resume
> > > > > 3) vector callback
> > > > > 4) interrupt remapping
> > > > > 
> > > > > 2) is not on the hot path.
> > > > > I did individual measurements of 3) at some points and it was a
> > > > > clear win.
> > > > 
> > > > That might depend on the hardware. Could it be newer processors
> > > > are
> > > > faster here?
> > > 
> > > I don't think so: the alternative it's an emulated interrupt. It's
> > > slower under all points of view.
> > 
> > What about APIC virtualization of modern processors? Are you sure
> > e.g.
> > timer interrupts aren't handled completely by the processor? I guess
> > this might be faster than letting it be handled by the hypervisor and
> > then use the callback into the guest.
> > 
> ... I kind of remember an email exchange we had, not here on the list,
> but in private, about some apparently weird scheduling behavior you
> were seeing, there at Oracle, on a particular benchmark/customer's
> workload.
> 
> Not that this is directly related, but I seem to also recall that you
> managed to find out that some of the perf difference (between baremetal
> and guest) was due to vAPIC being faster than the PV path we were
> taking? What I don't recall, though, is whether your guest was PV or
> (PV)HVM... Do you remember anything more precisely than this?

It was HVM and it was a 2.6.39 kernel. And it was due to the Linux kernel
scheduler not scheduling applications back to back on the same CPU
but instead having them on seperate CPUs. The end result being that when
an application blocked, the kernel would call 'schedule()' and instead
of the other application resuming, the kernel would go to sleep (until
a VIRQ TIMER).

The solution was quite simple - force the kernel to think that every
CPU was a sibling. And we flushed this out more and provided an
'smt=1' option so that the CPU topology was exposed (along with
pinning).

> 
> It was like one or two years ago... (I'll dig in the archives for the
> emails.)
> 
> Regards,
> Dario
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Solved: HVM guest performance regression
  2017-06-08 21:00                     ` Dario Faggioli
  2017-06-11  2:27                       ` Konrad Rzeszutek Wilk
@ 2017-06-12  5:48                       ` Juergen Gross
  2017-06-12  7:35                         ` Andrew Cooper
  1 sibling, 1 reply; 27+ messages in thread
From: Juergen Gross @ 2017-06-12  5:48 UTC (permalink / raw)
  To: Dario Faggioli, Stefano Stabellini; +Cc: xen-devel, Ian Jackson, Wei Liu

On 08/06/17 23:00, Dario Faggioli wrote:
> Bringing in Konrad because...
> 
> On Thu, 2017-06-08 at 11:37 +0200, Juergen Gross wrote:
>> On 07/06/17 20:19, Stefano Stabellini wrote:
>>> On Wed, 7 Jun 2017, Juergen Gross wrote:
>>>> On 06/06/17 21:08, Stefano Stabellini wrote:
>>>>>
>>>>> 2) PV suspend/resume
>>>>> 3) vector callback
>>>>> 4) interrupt remapping
>>>>>
>>>>> 2) is not on the hot path.
>>>>> I did individual measurements of 3) at some points and it was a
>>>>> clear win.
>>>>
>>>> That might depend on the hardware. Could it be newer processors
>>>> are
>>>> faster here?
>>>
>>> I don't think so: the alternative it's an emulated interrupt. It's
>>> slower under all points of view.
>>
>> What about APIC virtualization of modern processors? Are you sure
>> e.g.
>> timer interrupts aren't handled completely by the processor? I guess
>> this might be faster than letting it be handled by the hypervisor and
>> then use the callback into the guest.
>>
> ... I kind of remember an email exchange we had, not here on the list,
> but in private, about some apparently weird scheduling behavior you
> were seeing, there at Oracle, on a particular benchmark/customer's
> workload.
> 
> Not that this is directly related, but I seem to also recall that you
> managed to find out that some of the perf difference (between baremetal
> and guest) was due to vAPIC being faster than the PV path we were
> taking? What I don't recall, though, is whether your guest was PV or
> (PV)HVM... Do you remember anything more precisely than this?

I now tweaked the kernel to use the LAPIC timer instead of the pv one.

While it is a very little bit faster (<1%) this doesn't seem to be the
reason for the performance drop.

Using xentrace I've verified that no additional hypercalls or other
VMEXITs are occurring which would explain what is happening (I'm
seeing setting the timer and the related timer interrupt 250 times
a second, what is expected).

Using ftrace in the kernel I can see all functions being called on
the munmap path. Nothing worrying and no weird differences between the
pv and the non-pv test.

What is interesting is that the time for the pv test isn't lost at one
or two specific points, but all over the test. All function seem to run
just a little bit slower as in the non-pv case.

So I concluded it might be TLB related. The main difference between
using pv interfaces or not is the mapping of the shared info page into
the guest. The guest physical page for the shared info page is allocated
rather early via extend_brk(). Mapping the shared info page into the
guest requires that specific page to be mapped via a 4kB EPT entry,
resulting in breaking up a 2MB entry. So at least most of the other data
allocated via extend_brk() in the kernel will be hit by this large page
break up. The main other data allocated this way are the early page
tables which are essential for nearly all virtual addresses of the
kernel address space.

Instead of using extend_brk() I had a try allocating the shared info
pfn from the first MB of memory, as this area is already mapped via
4kB EPT entries. And indeed: this measure did speed up the munmap test
even when using pv interfaces in the guest.

I'll send a proper patch for the kernel after doing some more testing.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Solved: HVM guest performance regression
  2017-06-12  5:48                       ` Solved: " Juergen Gross
@ 2017-06-12  7:35                         ` Andrew Cooper
  2017-06-12  7:47                           ` Juergen Gross
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Cooper @ 2017-06-12  7:35 UTC (permalink / raw)
  To: Juergen Gross, Dario Faggioli, Stefano Stabellini
  Cc: xen-devel, Ian Jackson, Wei Liu

On 12/06/2017 06:48, Juergen Gross wrote:
> On 08/06/17 23:00, Dario Faggioli wrote:
>> Bringing in Konrad because...
>>
>> On Thu, 2017-06-08 at 11:37 +0200, Juergen Gross wrote:
>>> On 07/06/17 20:19, Stefano Stabellini wrote:
>>>> On Wed, 7 Jun 2017, Juergen Gross wrote:
>>>>> On 06/06/17 21:08, Stefano Stabellini wrote:
>>>>>> 2) PV suspend/resume
>>>>>> 3) vector callback
>>>>>> 4) interrupt remapping
>>>>>>
>>>>>> 2) is not on the hot path.
>>>>>> I did individual measurements of 3) at some points and it was a
>>>>>> clear win.
>>>>> That might depend on the hardware. Could it be newer processors
>>>>> are
>>>>> faster here?
>>>> I don't think so: the alternative it's an emulated interrupt. It's
>>>> slower under all points of view.
>>> What about APIC virtualization of modern processors? Are you sure
>>> e.g.
>>> timer interrupts aren't handled completely by the processor? I guess
>>> this might be faster than letting it be handled by the hypervisor and
>>> then use the callback into the guest.
>>>
>> ... I kind of remember an email exchange we had, not here on the list,
>> but in private, about some apparently weird scheduling behavior you
>> were seeing, there at Oracle, on a particular benchmark/customer's
>> workload.
>>
>> Not that this is directly related, but I seem to also recall that you
>> managed to find out that some of the perf difference (between baremetal
>> and guest) was due to vAPIC being faster than the PV path we were
>> taking? What I don't recall, though, is whether your guest was PV or
>> (PV)HVM... Do you remember anything more precisely than this?
> I now tweaked the kernel to use the LAPIC timer instead of the pv one.
>
> While it is a very little bit faster (<1%) this doesn't seem to be the
> reason for the performance drop.
>
> Using xentrace I've verified that no additional hypercalls or other
> VMEXITs are occurring which would explain what is happening (I'm
> seeing setting the timer and the related timer interrupt 250 times
> a second, what is expected).
>
> Using ftrace in the kernel I can see all functions being called on
> the munmap path. Nothing worrying and no weird differences between the
> pv and the non-pv test.
>
> What is interesting is that the time for the pv test isn't lost at one
> or two specific points, but all over the test. All function seem to run
> just a little bit slower as in the non-pv case.
>
> So I concluded it might be TLB related. The main difference between
> using pv interfaces or not is the mapping of the shared info page into
> the guest. The guest physical page for the shared info page is allocated
> rather early via extend_brk(). Mapping the shared info page into the
> guest requires that specific page to be mapped via a 4kB EPT entry,
> resulting in breaking up a 2MB entry. So at least most of the other data
> allocated via extend_brk() in the kernel will be hit by this large page
> break up. The main other data allocated this way are the early page
> tables which are essential for nearly all virtual addresses of the
> kernel address space.
>
> Instead of using extend_brk() I had a try allocating the shared info
> pfn from the first MB of memory, as this area is already mapped via
> 4kB EPT entries. And indeed: this measure did speed up the munmap test
> even when using pv interfaces in the guest.
>
> I'll send a proper patch for the kernel after doing some more testing.

Is it practical to use somewhere other than the first MB of memory?

The only reason the first 2M of memory is mapped with 4k EPT entries is
because of MTRRs.  I'm still hoping we can sensibly disable them for PVH
workloads, after which, the guest could be mapped using exclusively 1G
EPT mappings (if such RAM/alignment were available in the system).

Ideally, all mapped-in frames (including grants, foreign frames, etc)
would use GFNs above the top of RAM, so as never to shatter any of the
host superpage mappings RAM.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Solved: HVM guest performance regression
  2017-06-12  7:35                         ` Andrew Cooper
@ 2017-06-12  7:47                           ` Juergen Gross
  2017-06-12  8:30                             ` Andrew Cooper
  0 siblings, 1 reply; 27+ messages in thread
From: Juergen Gross @ 2017-06-12  7:47 UTC (permalink / raw)
  To: Andrew Cooper, Dario Faggioli, Stefano Stabellini
  Cc: xen-devel, Ian Jackson, Wei Liu

On 12/06/17 09:35, Andrew Cooper wrote:
> On 12/06/2017 06:48, Juergen Gross wrote:
>> On 08/06/17 23:00, Dario Faggioli wrote:
>>> Bringing in Konrad because...
>>>
>>> On Thu, 2017-06-08 at 11:37 +0200, Juergen Gross wrote:
>>>> On 07/06/17 20:19, Stefano Stabellini wrote:
>>>>> On Wed, 7 Jun 2017, Juergen Gross wrote:
>>>>>> On 06/06/17 21:08, Stefano Stabellini wrote:
>>>>>>> 2) PV suspend/resume
>>>>>>> 3) vector callback
>>>>>>> 4) interrupt remapping
>>>>>>>
>>>>>>> 2) is not on the hot path.
>>>>>>> I did individual measurements of 3) at some points and it was a
>>>>>>> clear win.
>>>>>> That might depend on the hardware. Could it be newer processors
>>>>>> are
>>>>>> faster here?
>>>>> I don't think so: the alternative it's an emulated interrupt. It's
>>>>> slower under all points of view.
>>>> What about APIC virtualization of modern processors? Are you sure
>>>> e.g.
>>>> timer interrupts aren't handled completely by the processor? I guess
>>>> this might be faster than letting it be handled by the hypervisor and
>>>> then use the callback into the guest.
>>>>
>>> ... I kind of remember an email exchange we had, not here on the list,
>>> but in private, about some apparently weird scheduling behavior you
>>> were seeing, there at Oracle, on a particular benchmark/customer's
>>> workload.
>>>
>>> Not that this is directly related, but I seem to also recall that you
>>> managed to find out that some of the perf difference (between baremetal
>>> and guest) was due to vAPIC being faster than the PV path we were
>>> taking? What I don't recall, though, is whether your guest was PV or
>>> (PV)HVM... Do you remember anything more precisely than this?
>> I now tweaked the kernel to use the LAPIC timer instead of the pv one.
>>
>> While it is a very little bit faster (<1%) this doesn't seem to be the
>> reason for the performance drop.
>>
>> Using xentrace I've verified that no additional hypercalls or other
>> VMEXITs are occurring which would explain what is happening (I'm
>> seeing setting the timer and the related timer interrupt 250 times
>> a second, what is expected).
>>
>> Using ftrace in the kernel I can see all functions being called on
>> the munmap path. Nothing worrying and no weird differences between the
>> pv and the non-pv test.
>>
>> What is interesting is that the time for the pv test isn't lost at one
>> or two specific points, but all over the test. All function seem to run
>> just a little bit slower as in the non-pv case.
>>
>> So I concluded it might be TLB related. The main difference between
>> using pv interfaces or not is the mapping of the shared info page into
>> the guest. The guest physical page for the shared info page is allocated
>> rather early via extend_brk(). Mapping the shared info page into the
>> guest requires that specific page to be mapped via a 4kB EPT entry,
>> resulting in breaking up a 2MB entry. So at least most of the other data
>> allocated via extend_brk() in the kernel will be hit by this large page
>> break up. The main other data allocated this way are the early page
>> tables which are essential for nearly all virtual addresses of the
>> kernel address space.
>>
>> Instead of using extend_brk() I had a try allocating the shared info
>> pfn from the first MB of memory, as this area is already mapped via
>> 4kB EPT entries. And indeed: this measure did speed up the munmap test
>> even when using pv interfaces in the guest.
>>
>> I'll send a proper patch for the kernel after doing some more testing.
> 
> Is it practical to use somewhere other than the first MB of memory?
> 
> The only reason the first 2M of memory is mapped with 4k EPT entries is
> because of MTRRs.  I'm still hoping we can sensibly disable them for PVH
> workloads, after which, the guest could be mapped using exclusively 1G
> EPT mappings (if such RAM/alignment were available in the system).>
> Ideally, all mapped-in frames (including grants, foreign frames, etc)
> would use GFNs above the top of RAM, so as never to shatter any of the
> host superpage mappings RAM.

Right. We can easily move to such a region (e.g. Xen PCI-device memory)
when we've removed the MTRR settings for the low memory. Right now using
the low 1MB of memory is working well and requires only very limited
changes, thus making a backport much easier.

BTW: I could imagine using a special GFN region for all special mapped
data might require some hypervisor tweaks, too.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Solved: HVM guest performance regression
  2017-06-12  7:47                           ` Juergen Gross
@ 2017-06-12  8:30                             ` Andrew Cooper
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Cooper @ 2017-06-12  8:30 UTC (permalink / raw)
  To: Juergen Gross, Dario Faggioli, Stefano Stabellini
  Cc: xen-devel, Ian Jackson, Wei Liu

On 12/06/2017 08:47, Juergen Gross wrote:
> On 12/06/17 09:35, Andrew Cooper wrote:
>> On 12/06/2017 06:48, Juergen Gross wrote:
>>> On 08/06/17 23:00, Dario Faggioli wrote:
>>>> Bringing in Konrad because...
>>>>
>>>> On Thu, 2017-06-08 at 11:37 +0200, Juergen Gross wrote:
>>>>> On 07/06/17 20:19, Stefano Stabellini wrote:
>>>>>> On Wed, 7 Jun 2017, Juergen Gross wrote:
>>>>>>> On 06/06/17 21:08, Stefano Stabellini wrote:
>>>>>>>> 2) PV suspend/resume
>>>>>>>> 3) vector callback
>>>>>>>> 4) interrupt remapping
>>>>>>>>
>>>>>>>> 2) is not on the hot path.
>>>>>>>> I did individual measurements of 3) at some points and it was a
>>>>>>>> clear win.
>>>>>>> That might depend on the hardware. Could it be newer processors
>>>>>>> are
>>>>>>> faster here?
>>>>>> I don't think so: the alternative it's an emulated interrupt. It's
>>>>>> slower under all points of view.
>>>>> What about APIC virtualization of modern processors? Are you sure
>>>>> e.g.
>>>>> timer interrupts aren't handled completely by the processor? I guess
>>>>> this might be faster than letting it be handled by the hypervisor and
>>>>> then use the callback into the guest.
>>>>>
>>>> ... I kind of remember an email exchange we had, not here on the list,
>>>> but in private, about some apparently weird scheduling behavior you
>>>> were seeing, there at Oracle, on a particular benchmark/customer's
>>>> workload.
>>>>
>>>> Not that this is directly related, but I seem to also recall that you
>>>> managed to find out that some of the perf difference (between baremetal
>>>> and guest) was due to vAPIC being faster than the PV path we were
>>>> taking? What I don't recall, though, is whether your guest was PV or
>>>> (PV)HVM... Do you remember anything more precisely than this?
>>> I now tweaked the kernel to use the LAPIC timer instead of the pv one.
>>>
>>> While it is a very little bit faster (<1%) this doesn't seem to be the
>>> reason for the performance drop.
>>>
>>> Using xentrace I've verified that no additional hypercalls or other
>>> VMEXITs are occurring which would explain what is happening (I'm
>>> seeing setting the timer and the related timer interrupt 250 times
>>> a second, what is expected).
>>>
>>> Using ftrace in the kernel I can see all functions being called on
>>> the munmap path. Nothing worrying and no weird differences between the
>>> pv and the non-pv test.
>>>
>>> What is interesting is that the time for the pv test isn't lost at one
>>> or two specific points, but all over the test. All function seem to run
>>> just a little bit slower as in the non-pv case.
>>>
>>> So I concluded it might be TLB related. The main difference between
>>> using pv interfaces or not is the mapping of the shared info page into
>>> the guest. The guest physical page for the shared info page is allocated
>>> rather early via extend_brk(). Mapping the shared info page into the
>>> guest requires that specific page to be mapped via a 4kB EPT entry,
>>> resulting in breaking up a 2MB entry. So at least most of the other data
>>> allocated via extend_brk() in the kernel will be hit by this large page
>>> break up. The main other data allocated this way are the early page
>>> tables which are essential for nearly all virtual addresses of the
>>> kernel address space.
>>>
>>> Instead of using extend_brk() I had a try allocating the shared info
>>> pfn from the first MB of memory, as this area is already mapped via
>>> 4kB EPT entries. And indeed: this measure did speed up the munmap test
>>> even when using pv interfaces in the guest.
>>>
>>> I'll send a proper patch for the kernel after doing some more testing.
>> Is it practical to use somewhere other than the first MB of memory?
>>
>> The only reason the first 2M of memory is mapped with 4k EPT entries is
>> because of MTRRs.  I'm still hoping we can sensibly disable them for PVH
>> workloads, after which, the guest could be mapped using exclusively 1G
>> EPT mappings (if such RAM/alignment were available in the system).>
>> Ideally, all mapped-in frames (including grants, foreign frames, etc)
>> would use GFNs above the top of RAM, so as never to shatter any of the
>> host superpage mappings RAM.
> Right. We can easily move to such a region (e.g. Xen PCI-device memory)
> when we've removed the MTRR settings for the low memory. Right now using
> the low 1MB of memory is working well and requires only very limited
> changes, thus making a backport much easier.

Good point.

>
> BTW: I could imagine using a special GFN region for all special mapped
> data might require some hypervisor tweaks, too.

Not any changes I'm aware of.  One of the many things Xen should
currently do (and doesn't) is limit the guests choice of gfns into a
range pre-determined by the toolstack.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-06-12  8:30 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-26 16:14 HVM guest performance regression Juergen Gross
2017-05-26 16:19 ` [for-4.9] " Ian Jackson
2017-05-26 17:00   ` Juergen Gross
2017-05-26 19:01     ` Stefano Stabellini
2017-05-29 19:05       ` Juergen Gross
2017-05-30  7:24         ` Jan Beulich
     [not found]         ` <592D3A3A020000780015D787@suse.com>
2017-05-30 10:33           ` Juergen Gross
2017-05-30 10:43             ` Jan Beulich
     [not found]             ` <592D68DC020000780015D919@suse.com>
2017-05-30 14:57               ` Juergen Gross
2017-05-30 15:10                 ` Jan Beulich
2017-06-06 13:44       ` Juergen Gross
2017-06-06 16:39         ` Stefano Stabellini
2017-06-06 19:00           ` Juergen Gross
2017-06-06 19:08             ` Stefano Stabellini
2017-06-07  6:55               ` Juergen Gross
2017-06-07 18:19                 ` Stefano Stabellini
2017-06-08  9:37                   ` Juergen Gross
2017-06-08 18:09                     ` Stefano Stabellini
2017-06-08 18:28                       ` Juergen Gross
2017-06-08 21:00                     ` Dario Faggioli
2017-06-11  2:27                       ` Konrad Rzeszutek Wilk
2017-06-12  5:48                       ` Solved: " Juergen Gross
2017-06-12  7:35                         ` Andrew Cooper
2017-06-12  7:47                           ` Juergen Gross
2017-06-12  8:30                             ` Andrew Cooper
2017-05-26 17:04 ` Dario Faggioli
2017-05-26 17:25   ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.