performance with guests running 2.4 kernels (specifically RHEL3)

All of lore.kernel.org
 help / color / mirror / Atom feed

* performance with guests running 2.4 kernels (specifically RHEL3)
@ 2008-04-16  0:15 David S. Ahern
  2008-04-16  8:46 ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-16  0:15 UTC (permalink / raw)
  To: kvm-devel

I have been looking at RHEL3 based guests lately, and to say the least the
performance is horrible. Rather than write a long tome on what I've done and
observed, I'd like to find out if anyone has some insights or known problem
areas running 2.4 guests. The short of it is that % system time spikes from time
to time (e.g., on exec of a new process such as running /bin/true).

I do not see the problem running RHEL3 on ESX, and an equivalent VM running
RHEL4 runs fine. That suggests that the 2.4 kernel is doing something in a way
that is not handled efficiently by kvm.

Can someone shed some light on it?

thanks,

david

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-16  0:15 performance with guests running 2.4 kernels (specifically RHEL3) David S. Ahern
@ 2008-04-16  8:46 ` Avi Kivity
  2008-04-17 21:12   ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-16  8:46 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> I have been looking at RHEL3 based guests lately, and to say the least the
> performance is horrible. Rather than write a long tome on what I've done and
> observed, I'd like to find out if anyone has some insights or known problem
> areas running 2.4 guests. The short of it is that % system time spikes from time
> to time (e.g., on exec of a new process such as running /bin/true).
>
> I do not see the problem running RHEL3 on ESX, and an equivalent VM running
> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something in a way
> that is not handled efficiently by kvm.
>
> Can someone shed some light on it?
>   

It's not something that I test regularly.  If you're running a 32-bit 
kernel, I'd suspect kmap(), or perhaps false positives from the fork 
detector.

kvmtrace will probably give enough info to tell exactly what's going on; 
'kvmstat -1' while the badness is happening may also help.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-16  8:46 ` Avi Kivity
@ 2008-04-17 21:12   ` David S. Ahern
  2008-04-18  7:57     ` Avi Kivity
  2008-04-23  8:03     ` Avi Kivity
  0 siblings, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-17 21:12 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

kvm_stat -1 is practically impossible to time correctly to get a good snippet.

kvmtrace is a fascinating tool. I captured trace data that encompassed one
intense period where the VM appeared to freeze (no terminal response for a few
seconds).

After converting to text I examined an arbitrary section in time (how do you
correlate tsc to unix epoch?), and it shows vcpu0 hammered with interrupts and
vcpu1 hammered with page faults. (I put the representative data below; I can
send the binary or text files if you really want to see them.) All toll over
about a 10-12 second time period the trace text files contain 8426221 lines and
2051344 of them are PAGE_FAULTs (that's 24% of the text lines which seems really
high).

david

---------------------------------

vcpu0 data:

0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400020536 (+    1712)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400096784 (+   76248)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400098576 (+    1792)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400114528 (+   15952)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400116328 (+    1800)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400137216 (+   20888)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400138840 (+    1624)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400209344 (+   70504)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400211056 (+    1712)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400226312 (+   15256)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400228040 (+    1728)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400248688 (+   20648)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]




vcpu1 data:

9968400002032 (+    3808)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c016127f ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400005448 (+    3416)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400009832 (+    4384)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c016104a ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x0000000b, virt = 0x00000000 fffb6f88 ]
9968400071584 (+   61752)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400075608 (+    4024)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400083528 (+    7920)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400087288 (+    3760)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400097312 (+   10024)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400103064 (+    5752)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160f9c ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400116624 (+   13560)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400120424 (+    3800)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160fa1 ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400123856 (+    3432)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400128208 (+    4352)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160dab ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000009, virt = 0x00000000 fffb6d28 ]
9968400183848 (+   55640)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400188232 (+    4384)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160e4d ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400196160 (+    7928)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400199928 (+    3768)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160e54 ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400209864 (+    9936)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400214984 (+    5120)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160f9c ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400228232 (+   13248)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400232000 (+    3768)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160fa1 ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400235424 (+    3424)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400239816 (+    4392)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160dab ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000009, virt = 0x00000000 fffb6d30 ]




Avi Kivity wrote:
> David S. Ahern wrote:
>> I have been looking at RHEL3 based guests lately, and to say the least
>> the
>> performance is horrible. Rather than write a long tome on what I've
>> done and
>> observed, I'd like to find out if anyone has some insights or known
>> problem
>> areas running 2.4 guests. The short of it is that % system time spikes
>> from time
>> to time (e.g., on exec of a new process such as running /bin/true).
>>
>> I do not see the problem running RHEL3 on ESX, and an equivalent VM
>> running
>> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something
>> in a way
>> that is not handled efficiently by kvm.
>>
>> Can someone shed some light on it?
>>   
> 
> It's not something that I test regularly.  If you're running a 32-bit
> kernel, I'd suspect kmap(), or perhaps false positives from the fork
> detector.
> 
> kvmtrace will probably give enough info to tell exactly what's going on;
> 'kvmstat -1' while the badness is happening may also help.
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-17 21:12   ` David S. Ahern
@ 2008-04-18  7:57     ` Avi Kivity
  2008-04-21  4:31       ` David S. Ahern
  2008-04-23  8:03     ` Avi Kivity
  1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-18  7:57 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> kvm_stat -1 is practically impossible to time correctly to get a good snippet.
>
> kvmtrace is a fascinating tool. I captured trace data that encompassed one
> intense period where the VM appeared to freeze (no terminal response for a few
> seconds).
>
> After converting to text I examined an arbitrary section in time (how do you
> correlate tsc to unix epoch?), and it shows vcpu0 hammered with interrupts and
> vcpu1 hammered with page faults. (I put the representative data below; I can
> send the binary or text files if you really want to see them.) All toll over
> about a 10-12 second time period the trace text files contain 8426221 lines and
> 2051344 of them are PAGE_FAULTs (that's 24% of the text lines which seems really
> high).
>
> david
>   

>
> vcpu1 data:
>
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db4 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000009, virt = 0x00000000 fffb6d28 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db4 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db4 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000009, virt = 0x00000000 fffb6d30 ]
>
>
>   

The pattern here is  c0009db4, c0009db0, fffb6xxx, c0009db0.  Setting a 
pte at c0009db0, accessing the page mapped by the pte, unmapping the 
pte.  Note that c0009db0 (bits 3:11) == 0x1b6 == fffb6xxx (bits 12:20).  
That's a kmap_atomic() + access +kunmap_atomic() sequence.

The expensive accesses ~50K cycles) seem to be the onces at fffb6xxx.  
Now theses shouldn't show up at all -- the kvm_mmu_pte_write() ought to 
have set up the ptes correctly.

Can you add a trace at mmu_guess_page_from_pte_write(), right before "if 
(is_present_pte(gpte))"?  I'm interested in gpa and gpte.  Also a trace 
at kvm_mmu_pte_write(), where it sets flooded = 1 (hmm, try to increase 
the 3 to 4 in the line right above that, maybe the fork detector is 
misfiring).


---------------------------------

vcpu0 data:

0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400020536 (+    1712)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400096784 (+   76248)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400098576 (+    1792)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400114528 (+   15952)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400116328 (+    1800)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400137216 (+   20888)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400138840 (+    1624)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400209344 (+   70504)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400211056 (+    1712)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400226312 (+   15256)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400228040 (+    1728)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400248688 (+   20648)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]




Those are probably IPIs due to the kmaps above.

>
> Avi Kivity wrote:
>   
>> David S. Ahern wrote:
>>     
>>> I have been looking at RHEL3 based guests lately, and to say the least
>>> the
>>> performance is horrible. Rather than write a long tome on what I've
>>> done and
>>> observed, I'd like to find out if anyone has some insights or known
>>> problem
>>> areas running 2.4 guests. The short of it is that % system time spikes
>>> from time
>>> to time (e.g., on exec of a new process such as running /bin/true).
>>>
>>> I do not see the problem running RHEL3 on ESX, and an equivalent VM
>>> running
>>> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something
>>> in a way
>>> that is not handled efficiently by kvm.
>>>
>>> Can someone shed some light on it?
>>>   
>>>       
>> It's not something that I test regularly.  If you're running a 32-bit
>> kernel, I'd suspect kmap(), or perhaps false positives from the fork
>> detector.
>>
>> kvmtrace will probably give enough info to tell exactly what's going on;
>> 'kvmstat -1' while the badness is happening may also help.
>>
>>     


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-18  7:57     ` Avi Kivity
@ 2008-04-21  4:31       ` David S. Ahern
  2008-04-21  9:19         ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-21  4:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

I added the traces and captured data over another apparent lockup of the guest.
This seems to be representative of the sequence (pid/vcpu removed).

(+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+3632)  VMENTRY
(+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000 fffb61c8 ]
(+   54928)  VMENTRY
(+4568)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41c5d363 ]
(+8432)  VMENTRY
(+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+   13832)  VMENTRY


(+5768)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+3712)  VMENTRY
(+4576)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000 fffb61d0 ]
(+   0)  PTE_WRITE      [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 3d55d047 ]
(+   65216)  VMENTRY
(+4232)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000 3d598363 ]
(+8640)  VMENTRY
(+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+   14160)  VMENTRY

I can forward a more complete time snippet if you'd like. vcpu0 + corresponding
vcpu1 files have 85000 total lines and compressed the files total ~500k.

I did not see the FLOODED trace come out during this sample though I did bump
the count from 3 to 4 as you suggested.


Correlating rip addresses to the 2.4 kernel:

c0160d00-c0161290 = page_referenced

It looks like the event is kscand running through the pages. I suspected this
some time ago, and tried tweaking the kscand_work_percent sysctl variable. It
appeared to lower the peak of the spikes, but maybe I imagined it. I believe
lowering that value makes kscand wake up more often but do less work (page
scanning) each time it is awakened.

david


Avi Kivity wrote:
> Can you add a trace at mmu_guess_page_from_pte_write(), right before "if 
> (is_present_pte(gpte))"?  I'm interested in gpa and gpte.  Also a trace 
> at kvm_mmu_pte_write(), where it sets flooded = 1 (hmm, try to increase 
> the 3 to 4 in the line right above that, maybe the fork detector is 
> misfiring).


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-21  4:31       ` David S. Ahern
@ 2008-04-21  9:19         ` Avi Kivity
  2008-04-21 17:07           ` David S. Ahern
  2008-04-22 20:23           ` David S. Ahern
  0 siblings, 2 replies; 73+ messages in thread
From: Avi Kivity @ 2008-04-21  9:19 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> I added the traces and captured data over another apparent lockup of the guest.
> This seems to be representative of the sequence (pid/vcpu removed).
>
> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+3632)  VMENTRY
> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000 fffb61c8 ]
> (+   54928)  VMENTRY
>   

Can you oprofile the host to see where the 54K cycles are spent?

> (+4568)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41c5d363 ]
> (+8432)  VMENTRY
> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+   13832)  VMENTRY
>
>
> (+5768)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+3712)  VMENTRY
> (+4576)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000 fffb61d0 ]
> (+   0)  PTE_WRITE      [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 3d55d047 ]
>   

This indeed has the accessed bit clear.

> (+   65216)  VMENTRY
> (+4232)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000 3d598363 ]
>   

This has the accessed bit set and the user bit clear, and the pte 
pointing at the previous pte_write gpa.  Looks like a kmap_atomic().

> (+8640)  VMENTRY
> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+   14160)  VMENTRY
>
> I can forward a more complete time snippet if you'd like. vcpu0 + corresponding
> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>
> I did not see the FLOODED trace come out during this sample though I did bump
> the count from 3 to 4 as you suggested.
>
>
>   

Bumping the count was supposed to remove the flooding...

> Correlating rip addresses to the 2.4 kernel:
>
> c0160d00-c0161290 = page_referenced
>
> It looks like the event is kscand running through the pages. I suspected this
> some time ago, and tried tweaking the kscand_work_percent sysctl variable. It
> appeared to lower the peak of the spikes, but maybe I imagined it. I believe
> lowering that value makes kscand wake up more often but do less work (page
> scanning) each time it is awakened.
>
>   

What does 'top' in the guest show (perhaps sorted by total cpu time 
rather than instantaneous usage)?

What host kernel are you running?  How many host cpus?

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-21  9:19         ` Avi Kivity
@ 2008-04-21 17:07           ` David S. Ahern
  2008-04-22 20:23           ` David S. Ahern
  1 sibling, 0 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-21 17:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

host:
    2.6.25-rc8, x86_64, kvm-66
    1 dual-core Xeon(R) CPU 3050 @ 2.13GHz
    6 GB RAM
    (This behavior also occurs on a larger server with 2 dual-core Xeon(R) CPU
5140 @ 2.33GHz, 4 GB RAM. Same kernel and kvm versions.)

guest:
    RHEL3 U8 (2.4.21-47.ELsmp), 2 vcpus, 2 GB RAM


As usual, waited for a guest "event" -- high system time in guest which appears
to lock it up. Following the event, kscand was the top CPU user (cumulative
time) in the guest.

During the event, 2 qemu threads are pegging the host CPU at 100%. Top samples
from oprofile (oprofile was started after the freeze start and stopped when
guest response returned):

samples  %        image name     app name       symbol name
171716   35.1350  kvm-intel.ko   kvm_intel     vmx_vcpu_run
45836     9.3786  vmlinux        vmlinux       copy_user_generic_string
39417     8.0652  kvm.ko         kvm           kvm_read_guest_atomic
23604     4.8296  vmlinux        vmlinux       add_preempt_count
22878     4.6811  vmlinux        vmlinux       __smp_call_function_mask
16143     3.3030  kvm.ko         kvm           gfn_to_hva
14648     2.9971  vmlinux        vmlinux       sub_preempt_count
14589     2.9851  kvm.ko         kvm           __gfn_to_memslot
11666     2.3870  kvm.ko         kvm           unalias_gfn
10834     2.2168  kvm.ko         kvm           kvm_mmu_zap_page
10532     2.1550  kvm.ko         kvm           paging64_prefetch_page
6285      1.2860  kvm-intel.ko   kvm_intel     handle_exception
6066      1.2412  kvm.ko         kvm           kvm_arch_vcpu_ioctl_run
4741      0.9701  kvm.ko         kvm           kvm_add_trace
3801      0.7777  vmlinux        vmlinux       __copy_from_user_inatomic
3592      0.7350  vmlinux        vmlinux       follow_page
3326      0.6805  kvm.ko         kvm           mmu_memory_cache_alloc
3317      0.6787  kvm-intel.ko   kvm_intel     kvm_handle_exit
2971      0.6079  kvm.ko         kvm           paging64_page_fault
2777      0.5682  kvm.ko         kvm           paging64_walk_addr
2294      0.4694  kvm.ko         kvm           kvm_mmu_pte_write
2278      0.4661  kvm.ko         kvm           kvm_flush_remote_tlbs
2266      0.4636  kvm-intel.ko   kvm_intel     vmcs_writel
2086      0.4268  kvm.ko         kvm           mmu_set_spte
2041      0.4176  kvm.ko         kvm           kvm_read_guest
1615      0.3304  vmlinux        vmlinux       free_hot_cold_page

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> I added the traces and captured data over another apparent lockup of
>> the guest.
>> This seems to be representative of the sequence (pid/vcpu removed).
>>
>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3632)  VMENTRY
>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61c8 ]
>> (+   54928)  VMENTRY
>>   
> 
> Can you oprofile the host to see where the 54K cycles are spent?
> 
>> (+4568)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 41c5d363 ]
>> (+8432)  VMENTRY
>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+   13832)  VMENTRY
>>
>>
>> (+5768)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3712)  VMENTRY
>> (+4576)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61d0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000
>> 3d55d047 ]
>>   
> 
> This indeed has the accessed bit clear.
> 
>> (+   65216)  VMENTRY
>> (+4232)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 3d598363 ]
>>   
> 
> This has the accessed bit set and the user bit clear, and the pte
> pointing at the previous pte_write gpa.  Looks like a kmap_atomic().
> 
>> (+8640)  VMENTRY
>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+   14160)  VMENTRY
>>
>> I can forward a more complete time snippet if you'd like. vcpu0 +
>> corresponding
>> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>>
>> I did not see the FLOODED trace come out during this sample though I
>> did bump
>> the count from 3 to 4 as you suggested.
>>
>>
>>   
> 
> Bumping the count was supposed to remove the flooding...
> 
>> Correlating rip addresses to the 2.4 kernel:
>>
>> c0160d00-c0161290 = page_referenced
>>
>> It looks like the event is kscand running through the pages. I
>> suspected this
>> some time ago, and tried tweaking the kscand_work_percent sysctl
>> variable. It
>> appeared to lower the peak of the spikes, but maybe I imagined it. I
>> believe
>> lowering that value makes kscand wake up more often but do less work
>> (page
>> scanning) each time it is awakened.
>>
>>   
> 
> What does 'top' in the guest show (perhaps sorted by total cpu time
> rather than instantaneous usage)?
> 
> What host kernel are you running?  How many host cpus?
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-21  9:19         ` Avi Kivity
  2008-04-21 17:07           ` David S. Ahern
@ 2008-04-22 20:23           ` David S. Ahern
  2008-04-23  8:04             ` Avi Kivity
  1 sibling, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-22 20:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles:

1. before vcpu->arch.mmu.page_fault()
2. after vcpu->arch.mmu.page_fault()
3. after mmu_topup_memory_caches()
4. after emulate_instruction()

So the delta in the trace reports show:
- cycles required for arch.mmu.page_fault (tracer 2)
- cycles required for mmu_topup_memory_caches(tracer 3)
- cycles required for emulate_instruction() (tracer 4)

I captured trace data for ~5-seconds during one of the usual events (again this
time it was due to kscand in the guest). I ran the formatted trace data through
an awk script to summarize:

    TSC cycles      tracer2   tracer3   tracer4
      0 -  10,000:   295067    213251    115873
 10,001 -  25,000:     7682      1004     98336
 25,001 -  50,000:      201        15        36
 50,001 - 100,000:   100655         0        10
        > 100,000:      117         0        15

This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl
5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it
took longer than 50,000 cycles. The page_fault function getting run is
paging64_page_fault.

mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times,
most of them relatively quickly.

Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few
host processes could interrupt it.

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> I added the traces and captured data over another apparent lockup of
>> the guest.
>> This seems to be representative of the sequence (pid/vcpu removed).
>>
>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3632)  VMENTRY
>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61c8 ]
>> (+   54928)  VMENTRY
>>   
> 
> Can you oprofile the host to see where the 54K cycles are spent?
> 
>> (+4568)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 41c5d363 ]
>> (+8432)  VMENTRY
>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+   13832)  VMENTRY
>>
>>
>> (+5768)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3712)  VMENTRY
>> (+4576)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61d0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000
>> 3d55d047 ]
>>   
> 
> This indeed has the accessed bit clear.
> 
>> (+   65216)  VMENTRY
>> (+4232)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 3d598363 ]
>>   
> 
> This has the accessed bit set and the user bit clear, and the pte
> pointing at the previous pte_write gpa.  Looks like a kmap_atomic().
> 
>> (+8640)  VMENTRY
>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+   14160)  VMENTRY
>>
>> I can forward a more complete time snippet if you'd like. vcpu0 +
>> corresponding
>> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>>
>> I did not see the FLOODED trace come out during this sample though I
>> did bump
>> the count from 3 to 4 as you suggested.
>>
>>
>>   
> 
> Bumping the count was supposed to remove the flooding...
> 
>> Correlating rip addresses to the 2.4 kernel:
>>
>> c0160d00-c0161290 = page_referenced
>>
>> It looks like the event is kscand running through the pages. I
>> suspected this
>> some time ago, and tried tweaking the kscand_work_percent sysctl
>> variable. It
>> appeared to lower the peak of the spikes, but maybe I imagined it. I
>> believe
>> lowering that value makes kscand wake up more often but do less work
>> (page
>> scanning) each time it is awakened.
>>
>>   
> 
> What does 'top' in the guest show (perhaps sorted by total cpu time
> rather than instantaneous usage)?
> 
> What host kernel are you running?  How many host cpus?
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-17 21:12   ` David S. Ahern
  2008-04-18  7:57     ` Avi Kivity
@ 2008-04-23  8:03     ` Avi Kivity
  1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-04-23  8:03 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> kvm_stat -1 is practically impossible to time correctly to get a good snippet.
>
>   

I've added a --log option to get vmstat-like output.  I've also added 
--fields to select which fields are of interest, to avoid the need for 
280-column displays.  That's now pushed to kvm-userspace.git.

Example:

./kvm_stat -f 'mmu.*|pf.*|remote.*' -l

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-22 20:23           ` David S. Ahern
@ 2008-04-23  8:04             ` Avi Kivity
  2008-04-23 15:23               ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-23  8:04 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles:
>
> 1. before vcpu->arch.mmu.page_fault()
> 2. after vcpu->arch.mmu.page_fault()
> 3. after mmu_topup_memory_caches()
> 4. after emulate_instruction()
>
> So the delta in the trace reports show:
> - cycles required for arch.mmu.page_fault (tracer 2)
> - cycles required for mmu_topup_memory_caches(tracer 3)
> - cycles required for emulate_instruction() (tracer 4)
>
> I captured trace data for ~5-seconds during one of the usual events (again this
> time it was due to kscand in the guest). I ran the formatted trace data through
> an awk script to summarize:
>
>     TSC cycles      tracer2   tracer3   tracer4
>       0 -  10,000:   295067    213251    115873
>  10,001 -  25,000:     7682      1004     98336
>  25,001 -  50,000:      201        15        36
>  50,001 - 100,000:   100655         0        10
>         > 100,000:      117         0        15
>
> This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl
> 5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it
> took longer than 50,000 cycles. The page_fault function getting run is
> paging64_page_fault.
>
>   

This does look like the fork detector.  Once in every four faults, it 
triggers and the fault becomes slow.  100K floods == 100K page tables == 
200GB of virtual memory, which seems excessive.

Is this running a forked load like apache, with many processes?  How 
much memory is on the guest, and is there any memory pressure?

> mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times,
> most of them relatively quickly.
> b
> Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few
> host processes could interrupt it.
>
> david
>
>
> Avi Kivity wrote:
>   
>> David S. Ahern wrote:
>>     
>>> I added the traces and captured data over another apparent lockup of
>>> the guest.
>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>
>>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016127c ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+3632)  VMENTRY
>>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016104a ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>>> fffb61c8 ]
>>> (+   54928)  VMENTRY
>>>   
>>>       
>> Can you oprofile the host to see where the 54K cycles are spent?
>>
>>     
>>> (+4568)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610e7 ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>>> 41c5d363 ]
>>> (+8432)  VMENTRY
>>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610ee ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db0 ]
>>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>>> 00000000 ]
>>> (+   13832)  VMENTRY
>>>
>>>
>>> (+5768)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016127c ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+3712)  VMENTRY
>>> (+4576)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016104a ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>>> fffb61d0 ]
>>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000
>>> 3d55d047 ]
>>>   
>>>       
>> This indeed has the accessed bit clear.
>>
>>     
>>> (+   65216)  VMENTRY
>>> (+4232)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610e7 ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>>> 3d598363 ]
>>>   
>>>       
>> This has the accessed bit set and the user bit clear, and the pte
>> pointing at the previous pte_write gpa.  Looks like a kmap_atomic().
>>
>>     
>>> (+8640)  VMENTRY
>>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610ee ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db0 ]
>>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>>> 00000000 ]
>>> (+   14160)  VMENTRY
>>>
>>> I can forward a more complete time snippet if you'd like. vcpu0 +
>>> corresponding
>>> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>>>
>>> I did not see the FLOODED trace come out during this sample though I
>>> did bump
>>> the count from 3 to 4 as you suggested.
>>>
>>>
>>>   
>>>       
>> Bumping the count was supposed to remove the flooding...
>>
>>     
>>> Correlating rip addresses to the 2.4 kernel:
>>>
>>> c0160d00-c0161290 = page_referenced
>>>
>>> It looks like the event is kscand running through the pages. I
>>> suspected this
>>> some time ago, and tried tweaking the kscand_work_percent sysctl
>>> variable. It
>>> appeared to lower the peak of the spikes, but maybe I imagined it. I
>>> believe
>>> lowering that value makes kscand wake up more often but do less work
>>> (page
>>> scanning) each time it is awakened.
>>>
>>>   
>>>       
>> What does 'top' in the guest show (perhaps sorted by total cpu time
>> rather than instantaneous usage)?
>>
>> What host kernel are you running?  How many host cpus?
>>
>>     


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-23  8:04             ` Avi Kivity
@ 2008-04-23 15:23               ` David S. Ahern
  2008-04-23 15:53                 ` Avi Kivity
  2008-04-25 17:33                 ` David S. Ahern
  0 siblings, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-23 15:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

>> Avi Kivity wrote:
>>
>>> David S. Ahern wrote:
>>>
>>>> I added the traces and captured data over another apparent lockup of
>>>> the guest.
>>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>>
>>>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016127c ]
>>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>>> c0009db4 ]
>>>> (+3632)  VMENTRY
>>>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016104a ]
>>>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>>>> fffb61c8 ]
>>>> (+   54928)  VMENTRY
>>>>
>>> Can you oprofile the host to see where the 54K cycles are spent?
>>>
>>>

I've continued drilling down with the tracers to answer that question. I have
done runs with tracers in paging64_page_fault and it showed the overhead is with
the fetch() function. On my last run the tracers are in paging64_fetch() as follows:

1. after is_present_pte() check
2. before kvm_mmu_get_page()
3. after kvm_mmu_get_page()
4. after if (!metaphysical) {}

The delta between 2 and 3 shows the cycles to run kvm_mmu_get_page(). The delta
between 3 and 4 shows the cycles to run kvm_read_guest_atomic(), if it is run.
Tracer1 dumps  vcpu->arch.last_pt_write_count (a carryover from when the new
tracers were in paging64_page_fault); tracer2 dumps the level, metaphysical and
access variables; tracer5 dumps value in shadow_ent.

A representative trace sample is:

(+    4576)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+       0)  PAGE_FAULT    [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ]
(+    2664)  PAGE_FAULT1   [ write_count = 0 ]
(+     472)  PAGE_FAULT2   [ level = 2 metaphysical = 0 access 0x00000007 ]
(+   50416)  PAGE_FAULT3
(+     472)  PAGE_FAULT4
(+     856)  PAGE_FAULT5   [ shadow_ent_val = 0x80000000 9276d043 ]
(+    1528)  VMENTRY
(+    4992)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+       0)  PAGE_FAULT    [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+    2296)  PAGE_FAULT1   [ write_count = 0 ]
(+     816)  PAGE_FAULT5   [ shadow_ent_val = 0x00000000 8a809041 ]
(+       0)  PTE_WRITE     [ gpa = 0x00000000 00009db4 gpte = 0x00000000 4176d363 ]
(+    6424)  VMENTRY
(+    3864)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+       0)  PAGE_FAULT    [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+    2496)  PAGE_FAULT1   [ write_count = 1 ]
(+     856)  PAGE_FAULT5   [ shadow_ent_val = 0x00000000 8a809041 ]
(+       0)  PTE_WRITE     [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+   10248)  VMENTRY
(+    4744)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+       0)  PAGE_FAULT    [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+    2408)  PAGE_FAULT1   [ write_count = 2 ]
(+     760)  PAGE_FAULT5   [ shadow_ent_val = 0x00000000 8a809043 ]
(+    1240)  VMENTRY
(+    4624)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+       0)  PAGE_FAULT    [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ]
(+    2512)  PAGE_FAULT1   [ write_count = 0 ]
(+     496)  PAGE_FAULT2   [ level = 2 metaphysical = 0 access 0x00000007 ]
(+   48664)  PAGE_FAULT3
(+     472)  PAGE_FAULT4
(+     856)  PAGE_FAULT5   [ shadow_ent_val = 0x80000000 9272d043 ]
(+    1576)  VMENTRY

So basically every 4th trip through the fetch function it runs
kvm_mmu_get_page(). A summary of the entire trace file shows this function
rarely executes in less than 50,000 cycles. Also, vcpu->arch.last_pt_write_count
is always 0 when the high cycles are hit.


More tidbits:
- The hugepage option seems to have no effect -- the system spikes and overhead
occurs with and without the hugepage option (above data is with it).

- As the guest runs for hours, the intensity of the spikes drop though they
still occur regularly and kscand continues to be the primary suspect. qemu's RSS
tends to the guests memory allotment of 2GB. Internally guest memory usage runs
at ~1GB page cache, 57M buffers, 24M swap, ~800MB for processes.

- I have looked at process creation and do not see a strong correlation between
system time spikes and number of new processes. So far the only correlations
seem to be kscand and amount of memory used. ie., stock RHEL3 with few processes
shows tiny spikes whereas my tests with 90+ processes using about 800M plus a
continually updating page cache (ie., moderate IO levels) the spikes are strong
and last for seconds.

- Time runs really fast in the guest, gaining several minutes in 24-hours.

I'll download your kvm_stat update and give it a try. When I started this
investigation I was using Christian's kvmstat script which dumped stats to a
file. Plots of that data did not show a strong correlation with guest system time.

david

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-23 15:23               ` David S. Ahern
@ 2008-04-23 15:53                 ` Avi Kivity
  2008-04-23 16:39                   ` David S. Ahern
  2008-04-25 17:33                 ` David S. Ahern
  1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-23 15:53 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> I've continued drilling down with the tracers to answer that question. I have
> done runs with tracers in paging64_page_fault and it showed the overhead is with
> the fetch() function. On my last run the tracers are in paging64_fetch() as follows:
>
> 1. after is_present_pte() check
> 2. before kvm_mmu_get_page()
> 3. after kvm_mmu_get_page()
> 4. after if (!metaphysical) {}
>
> The delta between 2 and 3 shows the cycles to run kvm_mmu_get_page(). The delta
> between 3 and 4 shows the cycles to run kvm_read_guest_atomic(), if it is run.
> Tracer1 dumps  vcpu->arch.last_pt_write_count (a carryover from when the new
> tracers were in paging64_page_fault); tracer2 dumps the level, metaphysical and
> access variables; tracer5 dumps value in shadow_ent.
>
> A representative trace sample is:
>
> (+    4576)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+       0)  PAGE_FAULT    [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ]
> (+    2664)  PAGE_FAULT1   [ write_count = 0 ]
> (+     472)  PAGE_FAULT2   [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+   50416)  PAGE_FAULT3
> (+     472)  PAGE_FAULT4
> (+     856)  PAGE_FAULT5   [ shadow_ent_val = 0x80000000 9276d043 ]
> (+    1528)  VMENTRY
> (+    4992)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+       0)  PAGE_FAULT    [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+    2296)  PAGE_FAULT1   [ write_count = 0 ]
> (+     816)  PAGE_FAULT5   [ shadow_ent_val = 0x00000000 8a809041 ]
> (+       0)  PTE_WRITE     [ gpa = 0x00000000 00009db4 gpte = 0x00000000 4176d363 ]
> (+    6424)  VMENTRY
> (+    3864)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+       0)  PAGE_FAULT    [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+    2496)  PAGE_FAULT1   [ write_count = 1 ]
> (+     856)  PAGE_FAULT5   [ shadow_ent_val = 0x00000000 8a809041 ]
> (+       0)  PTE_WRITE     [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+   10248)  VMENTRY
> (+    4744)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+       0)  PAGE_FAULT    [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+    2408)  PAGE_FAULT1   [ write_count = 2 ]
> (+     760)  PAGE_FAULT5   [ shadow_ent_val = 0x00000000 8a809043 ]
> (+    1240)  VMENTRY
> (+    4624)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+       0)  PAGE_FAULT    [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ]
> (+    2512)  PAGE_FAULT1   [ write_count = 0 ]
> (+     496)  PAGE_FAULT2   [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+   48664)  PAGE_FAULT3
> (+     472)  PAGE_FAULT4
> (+     856)  PAGE_FAULT5   [ shadow_ent_val = 0x80000000 9272d043 ]
> (+    1576)  VMENTRY
>
> So basically every 4th trip through the fetch function it runs
> kvm_mmu_get_page(). A summary of the entire trace file shows this function
> rarely executes in less than 50,000 cycles. Also, vcpu->arch.last_pt_write_count
> is always 0 when the high cycles are hit.
>
>   

Ah!  The flood detector is not seeing the access through the 
kmap_atomic() pte, because that access has gone through the emulator.  
last_updated_pte_accessed(vcpu) will never return true.

Can you verify that last_updated_pte_accessed(vcpu) indeed always 
returns false?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-23 15:53                 ` Avi Kivity
@ 2008-04-23 16:39                   ` David S. Ahern
  2008-04-24 17:25                     ` David S. Ahern
  2008-04-26  6:20                     ` Avi Kivity
  0 siblings, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-23 16:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel



Avi Kivity wrote:
> 
> Ah!  The flood detector is not seeing the access through the
> kmap_atomic() pte, because that access has gone through the emulator. 
> last_updated_pte_accessed(vcpu) will never return true.
> 
> Can you verify that last_updated_pte_accessed(vcpu) indeed always
> returns false?
> 

It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump
the rc of last_updated_pte_accessed(vcpu). ie.,
	pte_access = last_updated_pte_accessed(vcpu);
        KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler);

A sample:

(+    4488)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+       0)  PAGE_FAULT   [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
(+    2480)  PAGE_FAULT1  [ write_count = 0 ]
(+     424)  PAGE_FAULT2  [ level = 2 metaphysical = 0 access 0x00000007 ]
(+   51672)  PAGE_FAULT3
(+     472)  PAGE_FAULT4
(+     704)  PAGE_FAULT5  [ shadow_ent = 0x80000001 2dfb5043 ]
(+    1496)  VMENTRY
(+    4568)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+       0)  PAGE_FAULT   [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+    2352)  PAGE_FAULT1  [ write_count = 0 ]
(+     728)  PAGE_FAULT5  [ shadow_ent = 0x00000001 91409041 ]
(+       0)  PTE_WRITE    [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ]
(+       0)  PTE_ACCESS   [ pte_access = 1 ]
(+    6864)  VMENTRY
(+    3896)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+       0)  PAGE_FAULT   [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+    2376)  PAGE_FAULT1  [ write_count = 1 ]
(+     720)  PAGE_FAULT5  [ shadow_ent = 0x00000001 91409041 ]
(+       0)  PTE_WRITE    [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+       0)  PTE_ACCESS   [ pte_access = 0 ]
(+   12344)  VMENTRY
(+    4688)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+       0)  PAGE_FAULT   [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+    2416)  PAGE_FAULT1  [ write_count = 2 ]
(+     792)  PAGE_FAULT5  [ shadow_ent = 0x00000001 91409043 ]
(+    1128)  VMENTRY
(+    4512)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+       0)  PAGE_FAULT   [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
(+    2448)  PAGE_FAULT1  [ write_count = 0 ]
(+     448)  PAGE_FAULT2  [ level = 2 metaphysical = 0 access 0x00000007 ]
(+   51520)  PAGE_FAULT3
(+     432)  PAGE_FAULT4
(+     696)  PAGE_FAULT5  [ shadow_ent = 0x80000001 2df5a043 ]
(+    1480)  VMENTRY


david

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-23 16:39                   ` David S. Ahern
@ 2008-04-24 17:25                     ` David S. Ahern
  2008-04-26  6:43                       ` Avi Kivity
  2008-04-26  6:20                     ` Avi Kivity
  1 sibling, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-24 17:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel


What is the rip (GUEST_RIP) value in the VMEXIT trace output? Is that the
current instruction pointer for the guest?

I take it the virt in the PAGE_FAULT trace output is the virtual address the
guest was referencing when the page fault occurred. What I don't understand (one
of many things really) is what the 0xfffb63b0 corresponds to in the guest. Any
ideas?

Also, the expensive page fault occurs on errorcode = 0x0000000b (PAGE_FAULT
trace data). What does the 4th bit in 0xb mean? bit 0 set means
PFERR_PRESENT_MASK is set, and bit 1 means PT_WRITABLE_MASK. What is bit 3?

david


David S. Ahern wrote:
> 
> Avi Kivity wrote:
>> Ah!  The flood detector is not seeing the access through the
>> kmap_atomic() pte, because that access has gone through the emulator. 
>> last_updated_pte_accessed(vcpu) will never return true.
>>
>> Can you verify that last_updated_pte_accessed(vcpu) indeed always
>> returns false?
>>
> 
> It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump
> the rc of last_updated_pte_accessed(vcpu). ie.,
> 	pte_access = last_updated_pte_accessed(vcpu);
>         KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler);
> 
> A sample:
> 
> (+    4488)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+       0)  PAGE_FAULT   [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
> (+    2480)  PAGE_FAULT1  [ write_count = 0 ]
> (+     424)  PAGE_FAULT2  [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+   51672)  PAGE_FAULT3
> (+     472)  PAGE_FAULT4
> (+     704)  PAGE_FAULT5  [ shadow_ent = 0x80000001 2dfb5043 ]
> (+    1496)  VMENTRY
> (+    4568)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+       0)  PAGE_FAULT   [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+    2352)  PAGE_FAULT1  [ write_count = 0 ]
> (+     728)  PAGE_FAULT5  [ shadow_ent = 0x00000001 91409041 ]
> (+       0)  PTE_WRITE    [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ]
> (+       0)  PTE_ACCESS   [ pte_access = 1 ]
> (+    6864)  VMENTRY
> (+    3896)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+       0)  PAGE_FAULT   [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+    2376)  PAGE_FAULT1  [ write_count = 1 ]
> (+     720)  PAGE_FAULT5  [ shadow_ent = 0x00000001 91409041 ]
> (+       0)  PTE_WRITE    [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+       0)  PTE_ACCESS   [ pte_access = 0 ]
> (+   12344)  VMENTRY
> (+    4688)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+       0)  PAGE_FAULT   [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+    2416)  PAGE_FAULT1  [ write_count = 2 ]
> (+     792)  PAGE_FAULT5  [ shadow_ent = 0x00000001 91409043 ]
> (+    1128)  VMENTRY
> (+    4512)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+       0)  PAGE_FAULT   [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
> (+    2448)  PAGE_FAULT1  [ write_count = 0 ]
> (+     448)  PAGE_FAULT2  [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+   51520)  PAGE_FAULT3
> (+     432)  PAGE_FAULT4
> (+     696)  PAGE_FAULT5  [ shadow_ent = 0x80000001 2df5a043 ]
> (+    1480)  VMENTRY
> 
> 
> david
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-23 15:23               ` David S. Ahern
  2008-04-23 15:53                 ` Avi Kivity
@ 2008-04-25 17:33                 ` David S. Ahern
  2008-04-26  6:45                   ` Avi Kivity
  2008-04-28 18:15                   ` Marcelo Tosatti
  1 sibling, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-25 17:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

David S. Ahern wrote:
> Avi Kivity wrote:
>
>> David S. Ahern wrote:
>>
>>> I added the traces and captured data over another apparent lockup of
>>> the guest.
>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>
>>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016127c ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+3632)  VMENTRY
>>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016104a ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>>> fffb61c8 ]
>>> (+   54928)  VMENTRY
>>>
>> Can you oprofile the host to see where the 54K cycles are spent?
>>

Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():

        for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
                gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
                pte_gpa += (i+offset) * sizeof(pt_element_t);

                r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
                                          sizeof(pt_element_t));
                if (r || is_present_pte(pt))
                        sp->spt[i] = shadow_trap_nonpresent_pte;
                else
                        sp->spt[i] = shadow_notrap_nonpresent_pte;
        }

This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
loop.

This function gets run >20,000/sec during some of the kscand loops.

david

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-23 16:39                   ` David S. Ahern
  2008-04-24 17:25                     ` David S. Ahern
@ 2008-04-26  6:20                     ` Avi Kivity
  1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-04-26  6:20 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> Avi Kivity wrote:
>   
>> Ah!  The flood detector is not seeing the access through the
>> kmap_atomic() pte, because that access has gone through the emulator. 
>> last_updated_pte_accessed(vcpu) will never return true.
>>
>> Can you verify that last_updated_pte_accessed(vcpu) indeed always
>> returns false?
>>
>>     
>
> It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump
> the rc of last_updated_pte_accessed(vcpu). ie.,
> 	pte_access = last_updated_pte_accessed(vcpu);
>         KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler);
>
> A sample:
>
> (+    4488)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+       0)  PAGE_FAULT   [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
> (+    2480)  PAGE_FAULT1  [ write_count = 0 ]
> (+     424)  PAGE_FAULT2  [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+   51672)  PAGE_FAULT3
> (+     472)  PAGE_FAULT4
> (+     704)  PAGE_FAULT5  [ shadow_ent = 0x80000001 2dfb5043 ]
> (+    1496)  VMENTRY
> (+    4568)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+       0)  PAGE_FAULT   [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+    2352)  PAGE_FAULT1  [ write_count = 0 ]
> (+     728)  PAGE_FAULT5  [ shadow_ent = 0x00000001 91409041 ]
> (+       0)  PTE_WRITE    [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ]
> (+       0)  PTE_ACCESS   [ pte_access = 1 ]
> (+    6864)  VMENTRY
> (+    3896)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+       0)  PAGE_FAULT   [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+    2376)  PAGE_FAULT1  [ write_count = 1 ]
> (+     720)  PAGE_FAULT5  [ shadow_ent = 0x00000001 91409041 ]
> (+       0)  PTE_WRITE    [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+       0)  PTE_ACCESS   [ pte_access = 0 ]
> (+   12344)  VMENTRY
> (+    4688)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+       0)  PAGE_FAULT   [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+    2416)  PAGE_FAULT1  [ write_count = 2 ]
> (+     792)  PAGE_FAULT5  [ shadow_ent = 0x00000001 91409043 ]
> (+    1128)  VMENTRY
> (+    4512)  VMEXIT       [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+       0)  PAGE_FAULT   [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
> (+    2448)  PAGE_FAULT1  [ write_count = 0 ]
> (+     448)  PAGE_FAULT2  [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+   51520)  PAGE_FAULT3
> (+     432)  PAGE_FAULT4
> (+     696)  PAGE_FAULT5  [ shadow_ent = 0x80000001 2df5a043 ]
> (+    1480)  VMENTRY
>
>   

Strange... there should be at least two pte_access = 0 traces in there 
before flooding can occur, according to my reading of the code.  The 
counter needs to go up to 3 somehow.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-24 17:25                     ` David S. Ahern
@ 2008-04-26  6:43                       ` Avi Kivity
  0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-04-26  6:43 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> What is the rip (GUEST_RIP) value in the VMEXIT trace output? Is that the
> current instruction pointer for the guest?
>
>   

Yes.

> I take it the virt in the PAGE_FAULT trace output is the virtual address the
> guest was referencing when the page fault occurred. What I don't understand (one
> of many things really) is what the 0xfffb63b0 corresponds to in the guest. Any
> ideas?
>
>   

I'm pretty sure it is the kmap_atomic() pte.  The guest wants to update 
a pte (call it pte1), which is in HIGHMEM, so it doesn't have a 
permanent mapping for it.  It calls kmap_atomic() which sets up another 
pte (pte2, two writes), and then accesses pte1 through pte2.

> Also, the expensive page fault occurs on errorcode = 0x0000000b (PAGE_FAULT
> trace data). What does the 4th bit in 0xb mean? bit 0 set means
> PFERR_PRESENT_MASK is set, and bit 1 means PT_WRITABLE_MASK. What is bit 3?
>   

Bit 3 is the reserved bit, which means the shadow pte has an illegal bit 
combination.  kvm sets up vmx to forward non-persent page faults (bit 0 
clear) directly to the guest, so it needs some other pattern to get a 
trapping fault.

IOW, there are two types of non-present shadow ptes in kvm: trapping 
ones (where we don't know what the guest pte looks like) and nontrapping 
ones (where we know the guest pte is not present, so we forward the 
fault directly to the guest).  The first type is encoded with the 
reserved bit and present bit set, the second with both of them clear.

You can disable this trickery using the bypass_guest_pf module 
parameter.  It should be useful to try it, we'll see the forwarded 
faults as well.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-25 17:33                 ` David S. Ahern
@ 2008-04-26  6:45                   ` Avi Kivity
  2008-04-28 18:15                   ` Marcelo Tosatti
  1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-04-26  6:45 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> David S. Ahern wrote:
>   
>> Avi Kivity wrote:
>>
>>     
>>> David S. Ahern wrote:
>>>
>>>       
>>>> I added the traces and captured data over another apparent lockup of
>>>> the guest.
>>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>>
>>>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016127c ]
>>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>>> c0009db4 ]
>>>> (+3632)  VMENTRY
>>>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016104a ]
>>>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>>>> fffb61c8 ]
>>>> (+   54928)  VMENTRY
>>>>
>>>>         
>>> Can you oprofile the host to see where the 54K cycles are spent?
>>>
>>>       
>
> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
>
>         for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>                 gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
>                 pte_gpa += (i+offset) * sizeof(pt_element_t);
>
>                 r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
>                                           sizeof(pt_element_t));
>                 if (r || is_present_pte(pt))
>                         sp->spt[i] = shadow_trap_nonpresent_pte;
>                 else
>                         sp->spt[i] = shadow_notrap_nonpresent_pte;
>         }
>
> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
> loop.
>
> This function gets run >20,000/sec during some of the kscand loops.
>
>   

We really ought to optimize it.  That's second order however.  The real 
fix is making sure it isn't called so often.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-25 17:33                 ` David S. Ahern
  2008-04-26  6:45                   ` Avi Kivity
@ 2008-04-28 18:15                   ` Marcelo Tosatti
  2008-04-28 23:45                     ` David S. Ahern
  1 sibling, 1 reply; 73+ messages in thread
From: Marcelo Tosatti @ 2008-04-28 18:15 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel, Avi Kivity

On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote:
> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
> 
>         for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>                 gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
>                 pte_gpa += (i+offset) * sizeof(pt_element_t);
> 
>                 r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
>                                           sizeof(pt_element_t));
>                 if (r || is_present_pte(pt))
>                         sp->spt[i] = shadow_trap_nonpresent_pte;
>                 else
>                         sp->spt[i] = shadow_notrap_nonpresent_pte;
>         }
> 
> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
> loop.
> 
> This function gets run >20,000/sec during some of the kscand loops.

Hi David,

Do you see the mmu_recycled counter increase?

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-28 18:15                   ` Marcelo Tosatti
@ 2008-04-28 23:45                     ` David S. Ahern
  2008-04-30  4:18                       ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-28 23:45 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel, Avi Kivity

Hi Marcelo:

mmu_recycled is always 0 for this guest -- even after almost 4 hours of uptime.

Here is a kvm_stat sample where guest time was very high and qemu had 2
processors at 100% on the host. I removed counters where both columns have 0
value for brevity.

 exits                 45937979  758051
 fpu_reload             1416831      87
 halt_exits              112911       0
 halt_wakeup              31771       0
 host_state_reload      2068602     263
 insn_emulation        21601480  365493
 io_exits               1827374    2705
 irq_exits              8934818  285196
 mmio_exits              421674     147
 mmu_cache_miss         4817689   93680
 mmu_flooded            4815273   93680
 mmu_pde_zapped           51344       0
 mmu_prefetch           4817625   93680
 mmu_pte_updated       14803298  270104
 mmu_pte_write         19859863  363785
 mmu_shadow_zapped      4832106   93679
 pf_fixed              32184355  468398
 pf_guest                264138       0
 remote_tlb_flush      10697762  280522
 tlb_flush             10301338  176424

(NOTE: This is for a *5* second sample interval instead of 1 to allow me to
capture the data).

Here's a sample when the guest is "well-behaved" (system time <10%, though ):
 exits                 51502194   97453
 fpu_reload             1421736     227
 halt_exits              138361    1927
 halt_wakeup              33047     117
 host_state_reload      2110190    3740
 insn_emulation        24367441   47260
 io_exits               1874075    2576
 irq_exits             10224702   13333
 mmio_exits              435154    1726
 mmu_cache_miss         5414097   11258
 mmu_flooded            5411548   11243
 mmu_pde_zapped           52851      44
 mmu_prefetch           5414031   11258
 mmu_pte_updated       16854686   29901
 mmu_pte_write         22526765   42285
 mmu_shadow_zapped      5430025   11313
 pf_fixed              36144578   67666
 pf_guest                282794     430
 remote_tlb_flush      12126268   14619
 tlb_flush             11753162   21460

There is definitely a strong correlation between the mmu counters and high
system times in the guest. I am still trying to find out what in the guest is
stimulating it when running on RHEL3; I do not see this same behavior for an
equivalent setup running on RHEL4.

By the way I added an mmu_prefetch stat in prefetch_page() to count the number
of times the for() loop is hit with PTTYPE == 64; ie., number of times
paging64_prefetch_page() is invoked. (I wanted an explicit counter for this
loop, though the info seems to duplicate other entries.) That counter is listed
above. As I mentioned in a prior post when kscand kicks in the change in
mmu_prefetch counter is at 20,000+/sec, with each trip through that function
taking 45k+ cycles.

kscand is an instigator shortly after boot, however, kscand is *not* the culprit
once the system has been up for 30-45 minutes. I have started instrumenting the
RHEL3U8 kernel and for the load I am running kscand does not walk the active
lists very often once the system is up.

So, to dig deeper on what in the guest is stimulating the mmu I collected
kvmtrace data for about a 2 minute time interval which caught about a 30-second
period where guest system time was steady in the 25-30% range. Summarizing the
number of times a RIP appears in an VMEXIT shows the following high runners:

  count      RIP       RHEL3-symbol
  82549   0xc0140e42  follow_page [kernel] c0140d90 offset b2
  42532   0xc0144760  handle_mm_fault [kernel] c01446d0 offset 90
  36826   0xc013da4a  futex_wait [kernel] c013d870 offset 1da
  29987   0xc0145cd0  zap_pte_range [kernel] c0145c10 offset c0
  27451   0xc0144018  do_no_page [kernel] c0143e20 offset 1f8

(halt entry removed the list since that is the ideal scenario for an exit).

So the RIP correlates to follow_page() for a large percentage of the VMEXITs.

I wrote an awk script to summarize (histogram style) the TSC cycles between
VMEXIT and VMENTRY for an address. For the first rip, 0xc0140e42, 82,271 times
(ie., almost 100% of the time) the trace shows a delta between 50k and 100k
cycles between the VMEXIT and the subsequent VMENTRY. Similarly for the second
one, 0xc0144760, 42403 times (again almost 100% of the occurrences) the trace
shows a delta between 50k and 100k cycles between VMEXIT and VMENTRY. These
seems to correlate with the prefetch_page function in kvm, though I am not 100%
positive on that.

I am now investigating the kernel paths leading to those functions. Any insights
would definitely be appreciated.

david

Marcelo Tosatti wrote:
> On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote:
>> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
>>
>>         for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>>                 gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
>>                 pte_gpa += (i+offset) * sizeof(pt_element_t);
>>
>>                 r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
>>                                           sizeof(pt_element_t));
>>                 if (r || is_present_pte(pt))
>>                         sp->spt[i] = shadow_trap_nonpresent_pte;
>>                 else
>>                         sp->spt[i] = shadow_notrap_nonpresent_pte;
>>         }
>>
>> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
>> loop.
>>
>> This function gets run >20,000/sec during some of the kscand loops.
> 
> Hi David,
> 
> Do you see the mmu_recycled counter increase?
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-28 23:45                     ` David S. Ahern
@ 2008-04-30  4:18                       ` David S. Ahern
  2008-04-30  9:55                         ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-30  4:18 UTC (permalink / raw)
  To: Marcelo Tosatti, Avi Kivity; +Cc: kvm-devel

Another tidbit for you guys as I make my way through various permutations:
I installed the RHEL3 hugemem kernel and the guest behavior is *much* better.
System time still has some regular hiccups that are higher than xen and esx
(e.g., 1 minute samples out of 5 show system time between 10 and 15%), but
overall guest behavior is good with the hugemem kernel.

One side effect I've noticed is that I cannot restart the RHEL3 guest running
the hugemem kernel in successive attempts. The guest has 2 vcpus and qemu shows
one thread at 100% cpu. If I recall correctly kvm_stat shows a large amount of
tlb_flushes (like millions in a 5-second sample). The scenario is:
1. start guest running hugemem kernel,
2. shutdown,
3. restart guest.

During 3. it hangs, but at random points. Removing kvm/kvm-intel has no effect -
guest still hangs on the restart. Rebooting the host clears the problem.

Alternatively, during the hang on a restart I can kill the guest, and then on
restart choose the normal, 32-bit smp kernel and the guest boots just fine. At
this point I can shutdown the guest and restart with the hugemem kernel and it
boots just fine.

david


David S. Ahern wrote:
> Hi Marcelo:
> 
> mmu_recycled is always 0 for this guest -- even after almost 4 hours of uptime.
> 
> Here is a kvm_stat sample where guest time was very high and qemu had 2
> processors at 100% on the host. I removed counters where both columns have 0
> value for brevity.
> 
>  exits                 45937979  758051
>  fpu_reload             1416831      87
>  halt_exits              112911       0
>  halt_wakeup              31771       0
>  host_state_reload      2068602     263
>  insn_emulation        21601480  365493
>  io_exits               1827374    2705
>  irq_exits              8934818  285196
>  mmio_exits              421674     147
>  mmu_cache_miss         4817689   93680
>  mmu_flooded            4815273   93680
>  mmu_pde_zapped           51344       0
>  mmu_prefetch           4817625   93680
>  mmu_pte_updated       14803298  270104
>  mmu_pte_write         19859863  363785
>  mmu_shadow_zapped      4832106   93679
>  pf_fixed              32184355  468398
>  pf_guest                264138       0
>  remote_tlb_flush      10697762  280522
>  tlb_flush             10301338  176424
> 
> (NOTE: This is for a *5* second sample interval instead of 1 to allow me to
> capture the data).
> 
> Here's a sample when the guest is "well-behaved" (system time <10%, though ):
>  exits                 51502194   97453
>  fpu_reload             1421736     227
>  halt_exits              138361    1927
>  halt_wakeup              33047     117
>  host_state_reload      2110190    3740
>  insn_emulation        24367441   47260
>  io_exits               1874075    2576
>  irq_exits             10224702   13333
>  mmio_exits              435154    1726
>  mmu_cache_miss         5414097   11258
>  mmu_flooded            5411548   11243
>  mmu_pde_zapped           52851      44
>  mmu_prefetch           5414031   11258
>  mmu_pte_updated       16854686   29901
>  mmu_pte_write         22526765   42285
>  mmu_shadow_zapped      5430025   11313
>  pf_fixed              36144578   67666
>  pf_guest                282794     430
>  remote_tlb_flush      12126268   14619
>  tlb_flush             11753162   21460
> 
> 
> There is definitely a strong correlation between the mmu counters and high
> system times in the guest. I am still trying to find out what in the guest is
> stimulating it when running on RHEL3; I do not see this same behavior for an
> equivalent setup running on RHEL4.
> 
> By the way I added an mmu_prefetch stat in prefetch_page() to count the number
> of times the for() loop is hit with PTTYPE == 64; ie., number of times
> paging64_prefetch_page() is invoked. (I wanted an explicit counter for this
> loop, though the info seems to duplicate other entries.) That counter is listed
> above. As I mentioned in a prior post when kscand kicks in the change in
> mmu_prefetch counter is at 20,000+/sec, with each trip through that function
> taking 45k+ cycles.
> 
> kscand is an instigator shortly after boot, however, kscand is *not* the culprit
> once the system has been up for 30-45 minutes. I have started instrumenting the
> RHEL3U8 kernel and for the load I am running kscand does not walk the active
> lists very often once the system is up.
> 
> So, to dig deeper on what in the guest is stimulating the mmu I collected
> kvmtrace data for about a 2 minute time interval which caught about a 30-second
> period where guest system time was steady in the 25-30% range. Summarizing the
> number of times a RIP appears in an VMEXIT shows the following high runners:
> 
>   count      RIP       RHEL3-symbol
>   82549   0xc0140e42  follow_page [kernel] c0140d90 offset b2
>   42532   0xc0144760  handle_mm_fault [kernel] c01446d0 offset 90
>   36826   0xc013da4a  futex_wait [kernel] c013d870 offset 1da
>   29987   0xc0145cd0  zap_pte_range [kernel] c0145c10 offset c0
>   27451   0xc0144018  do_no_page [kernel] c0143e20 offset 1f8
> 
> (halt entry removed the list since that is the ideal scenario for an exit).
> 
> So the RIP correlates to follow_page() for a large percentage of the VMEXITs.
> 
> I wrote an awk script to summarize (histogram style) the TSC cycles between
> VMEXIT and VMENTRY for an address. For the first rip, 0xc0140e42, 82,271 times
> (ie., almost 100% of the time) the trace shows a delta between 50k and 100k
> cycles between the VMEXIT and the subsequent VMENTRY. Similarly for the second
> one, 0xc0144760, 42403 times (again almost 100% of the occurrences) the trace
> shows a delta between 50k and 100k cycles between VMEXIT and VMENTRY. These
> seems to correlate with the prefetch_page function in kvm, though I am not 100%
> positive on that.
> 
> I am now investigating the kernel paths leading to those functions. Any insights
> would definitely be appreciated.
> 
> david
> 
> 
> Marcelo Tosatti wrote:
>> On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote:
>>> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
>>>
>>>         for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>>>                 gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
>>>                 pte_gpa += (i+offset) * sizeof(pt_element_t);
>>>
>>>                 r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
>>>                                           sizeof(pt_element_t));
>>>                 if (r || is_present_pte(pt))
>>>                         sp->spt[i] = shadow_trap_nonpresent_pte;
>>>                 else
>>>                         sp->spt[i] = shadow_notrap_nonpresent_pte;
>>>         }
>>>
>>> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
>>> loop.
>>>
>>> This function gets run >20,000/sec during some of the kscand loops.
>> Hi David,
>>
>> Do you see the mmu_recycled counter increase?
>>
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-30  4:18                       ` David S. Ahern
@ 2008-04-30  9:55                         ` Avi Kivity
  2008-04-30 13:39                           ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-30  9:55 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti

David S. Ahern wrote:
> Another tidbit for you guys as I make my way through various permutations:
> I installed the RHEL3 hugemem kernel and the guest behavior is *much* better.
> System time still has some regular hiccups that are higher than xen and esx
> (e.g., 1 minute samples out of 5 show system time between 10 and 15%), but
> overall guest behavior is good with the hugemem kernel.
>
>   

Wait, the amount of info here is overwhelming. Let's stick with the 
current kernel (32-bit, HIGHMEM4G, right?)

Did you get any traces with bypass_guest_pf=0? That may show more info.

-- 

Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-30  9:55                         ` Avi Kivity
@ 2008-04-30 13:39                           ` David S. Ahern
  2008-04-30 13:49                             ` Avi Kivity
  2008-04-30 13:56                             ` Daniel P. Berrange
  0 siblings, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-30 13:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel, Marcelo Tosatti

Avi Kivity wrote:
> David S. Ahern wrote:
>> Another tidbit for you guys as I make my way through various
>> permutations:
>> I installed the RHEL3 hugemem kernel and the guest behavior is *much*
>> better.
>> System time still has some regular hiccups that are higher than xen
>> and esx
>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%),
>> but
>> overall guest behavior is good with the hugemem kernel.
>>
>>   
> 
> Wait, the amount of info here is overwhelming. Let's stick with the
> current kernel (32-bit, HIGHMEM4G, right?)
> 
> Did you get any traces with bypass_guest_pf=0? That may show more info.
> 

My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest.
My point in the last email was that the hugemem kernel shows a remarkable
difference (it uses 3-levels of page tables right?). I was hoping that would
ring a bell with someone.

Adding bypass_guest_pf=0 did not improve the situation. Did you want anything
particular with that setting -- like a RIP summary or a summary of exit-entry
cycles?

david

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-30 13:39                           ` David S. Ahern
@ 2008-04-30 13:49                             ` Avi Kivity
  2008-05-11 12:32                               ` Avi Kivity
  2008-04-30 13:56                             ` Daniel P. Berrange
  1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-30 13:49 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti

David S. Ahern wrote:
> Avi Kivity wrote:
>   
>> David S. Ahern wrote:
>>     
>>> Another tidbit for you guys as I make my way through various
>>> permutations:
>>> I installed the RHEL3 hugemem kernel and the guest behavior is *much*
>>> better.
>>> System time still has some regular hiccups that are higher than xen
>>> and esx
>>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%),
>>> but
>>> overall guest behavior is good with the hugemem kernel.
>>>
>>>   
>>>       
>> Wait, the amount of info here is overwhelming. Let's stick with the
>> current kernel (32-bit, HIGHMEM4G, right?)
>>
>> Did you get any traces with bypass_guest_pf=0? That may show more info.
>>
>>     
>
> My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest.
>   

Me too. I would like to see all reasonable guests supported well, 
without performance issues, and not have to tell the use which kernel to 
use.

> My point in the last email was that the hugemem kernel shows a remarkable
> difference (it uses 3-levels of page tables right?). I was hoping that would
> ring a bell with someone.
>   

 From the traces I saw I think the standard kernel is pae as well.  Can 
you verify?  I think it's CONFIG_HIGHMEM4G (instead of 
CONFIG_HIGHMEM64G) but that option may be different for such an old kernel.

> Adding bypass_guest_pf=0 did not improve the situation. Did you want anything
> particular with that setting -- like a RIP summary or a summary of exit-entry
> cycles?
>   

I asked fo this thinking bypass_guest_pf may help show more 
information.  But thinking a bit more, it will not.

I think I do know what the problem is.  I will try it out.  Is there a 
free clone (like centos) available somewhere?

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-30 13:39                           ` David S. Ahern
  2008-04-30 13:49                             ` Avi Kivity
@ 2008-04-30 13:56                             ` Daniel P. Berrange
  2008-04-30 14:23                               ` David S. Ahern
  1 sibling, 1 reply; 73+ messages in thread
From: Daniel P. Berrange @ 2008-04-30 13:56 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti, Avi Kivity

On Wed, Apr 30, 2008 at 07:39:53AM -0600, David S. Ahern wrote:
> Avi Kivity wrote:
> > David S. Ahern wrote:
> >> Another tidbit for you guys as I make my way through various
> >> permutations:
> >> I installed the RHEL3 hugemem kernel and the guest behavior is *much*
> >> better.
> >> System time still has some regular hiccups that are higher than xen
> >> and esx
> >> (e.g., 1 minute samples out of 5 show system time between 10 and 15%),
> >> but
> >> overall guest behavior is good with the hugemem kernel.
> >>
> >>   
> > 
> > Wait, the amount of info here is overwhelming. Let's stick with the
> > current kernel (32-bit, HIGHMEM4G, right?)
> > 
> > Did you get any traces with bypass_guest_pf=0? That may show more info.
> > 
> 
> My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest.
> My point in the last email was that the hugemem kernel shows a remarkable
> difference (it uses 3-levels of page tables right?). I was hoping that would
> ring a bell with someone.

IIRC, the RHEL-3  hugemem kernel is using the 4g/4g split patches which
give userspace and kernelspace their own independant pagetables

  http://lwn.net/Articles/39925/
  http://lwn.net/Articles/39283/

Dan.
-- 
|: Red Hat, Engineering, Boston   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-30 13:56                             ` Daniel P. Berrange
@ 2008-04-30 14:23                               ` David S. Ahern
  0 siblings, 0 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-30 14:23 UTC (permalink / raw)
  To: Daniel P. Berrange, Avi Kivity; +Cc: kvm-devel, Marcelo Tosatti

Yes, the 4G/4G patch and the 64G options are both enabled for the hugemem kernel:

CONFIG_HIGHMEM64G=y
CONFIG_X86_4G=y


Differences between the "standard" kernel and the hugemem kernel:

# diff config-2.4.21-47.ELsmp config-2.4.21-47.ELhugemem
2157,2158c2157,2158
< CONFIG_M686=y
< # CONFIG_MPENTIUMIII is not set
---
> # CONFIG_M686 is not set
> CONFIG_MPENTIUMIII=y
2169c2169
< CONFIG_X86_PGE=y
---
> # CONFIG_X86_PGE is not set
2193c2193
< # CONFIG_X86_4G is not set
---
> CONFIG_X86_4G=y
2365,2366c2365
< CONFIG_M686=y
< CONFIG_X86_PGE=y
---
> CONFIG_MPENTIUMIII=y
2369,2372d2367
< # CONFIG_MXT is not set
< CONFIG_HOTPLUG_PCI=y
< CONFIG_HOTPLUG_PCI_COMPAQ=m
< CONFIG_HOTPLUG_PCI_IBM=m
2373a2369
> CONFIG_X86_4G=y
2377,2379d2372
< # CONFIG_EWRK3 is not set
< CONFIG_UNIX98_PTY_COUNT=2048
< CONFIG_HZ=512
2382a2376,2383
> # CONFIG_MXT is not set
> CONFIG_HOTPLUG_PCI=y
> CONFIG_HOTPLUG_PCI_COMPAQ=m
> CONFIG_HOTPLUG_PCI_IBM=m
> # CONFIG_EWRK3 is not set
> CONFIG_UNIX98_PTY_COUNT=2048
> CONFIG_DEBUG_BUGVERBOSE=y
> # CONFIG_PNPBIOS is not set


Avi:

Centos releases:

http://isoredirect.centos.org/centos/3/isos/i386/

I am running RHEL3.8 which I do not see listed. Also, I'll need to work on a
stock install and try to capture some kind of workload that exhibits the
problem. It will be a couple of days.

david


Daniel P. Berrange wrote:
> On Wed, Apr 30, 2008 at 07:39:53AM -0600, David S. Ahern wrote:
>> Avi Kivity wrote:
>>> David S. Ahern wrote:
>>>> Another tidbit for you guys as I make my way through various
>>>> permutations:
>>>> I installed the RHEL3 hugemem kernel and the guest behavior is *much*
>>>> better.
>>>> System time still has some regular hiccups that are higher than xen
>>>> and esx
>>>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%),
>>>> but
>>>> overall guest behavior is good with the hugemem kernel.
>>>>
>>>>   
>>> Wait, the amount of info here is overwhelming. Let's stick with the
>>> current kernel (32-bit, HIGHMEM4G, right?)
>>>
>>> Did you get any traces with bypass_guest_pf=0? That may show more info.
>>>
>> My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest.
>> My point in the last email was that the hugemem kernel shows a remarkable
>> difference (it uses 3-levels of page tables right?). I was hoping that would
>> ring a bell with someone.
> 
> IIRC, the RHEL-3  hugemem kernel is using the 4g/4g split patches which
> give userspace and kernelspace their own independant pagetables
> 
>   http://lwn.net/Articles/39925/
>   http://lwn.net/Articles/39283/
> 
> Dan.

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-04-30 13:49                             ` Avi Kivity
@ 2008-05-11 12:32                               ` Avi Kivity
  2008-05-11 13:36                                 ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-11 12:32 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 602 bytes --]

Avi Kivity wrote:
>
> I asked fo this thinking bypass_guest_pf may help show more 
> information.  But thinking a bit more, it will not.
>
> I think I do know what the problem is.  I will try it out.  Is there a 
> free clone (like centos) available somewhere?

This patch tracks down emulated accesses to speculated ptes and marks 
them as accessed, preventing the flooding on centos-3.1.  Unfortunately 
it also causes a host oops midway through the boot process.

I believe the oops is merely exposed by the patch, not caused by it.

-- 
error compiling committee.c: too many arguments to function


[-- Attachment #2: prevent-kscand-flooding.patch --]
[-- Type: text/x-patch, Size: 2435 bytes --]

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3d769c3..8c1e7f3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1127,8 +1127,10 @@ unshadowed:
 		else
 			kvm_release_pfn_clean(pfn);
 	}
-	if (!ptwrite || !*ptwrite)
+	if (speculative) {
 		vcpu->arch.last_pte_updated = shadow_pte;
+		vcpu->arch.last_pte_gfn = gfn;
+	}
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -1674,6 +1676,17 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	vcpu->arch.update_pte.pfn = pfn;
 }
 
+static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u64 *spte = vcpu->arch.last_pte_updated;
+
+	if (spte
+	    && vcpu->arch.last_pte_gfn == gfn
+	    && shadow_accessed_mask
+	    && !(*spte & shadow_accessed_mask))
+		set_bit(PT_ACCESSED_SHIFT, spte);
+}
+
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		       const u8 *new, int bytes)
 {
@@ -1697,13 +1710,14 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
 	mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes);
 	spin_lock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_access_page(vcpu, gfn);
 	kvm_mmu_free_some_pages(vcpu);
 	++vcpu->kvm->stat.mmu_pte_write;
 	kvm_mmu_audit(vcpu, "pre pte write");
 	if (gfn == vcpu->arch.last_pt_write_gfn
 	    && !last_updated_pte_accessed(vcpu)) {
 		++vcpu->arch.last_pt_write_count;
-		if (vcpu->arch.last_pt_write_count >= 3)
+		if (vcpu->arch.last_pt_write_count >= 4)
 			flooded = 1;
 	} else {
 		vcpu->arch.last_pt_write_gfn = gfn;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 1730757..258e5d5 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -15,7 +15,8 @@
 #define PT_USER_MASK (1ULL << 2)
 #define PT_PWT_MASK (1ULL << 3)
 #define PT_PCD_MASK (1ULL << 4)
-#define PT_ACCESSED_MASK (1ULL << 5)
+#define PT_ACCESSED_SHIFT 5
+#define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
 #define PT_DIRTY_MASK (1ULL << 6)
 #define PT_PAGE_SIZE_MASK (1ULL << 7)
 #define PT_PAT_MASK (1ULL << 7)
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 1d8cd01..0bdb392 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -242,6 +242,7 @@ struct kvm_vcpu_arch {
 	gfn_t last_pt_write_gfn;
 	int   last_pt_write_count;
 	u64  *last_pte_updated;
+	gfn_t last_pte_gfn;
 
 	struct {
 		gfn_t gfn;	/* presumed gfn during guest pte update */

[-- Attachment #3: Type: text/plain, Size: 320 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

[-- Attachment #4: Type: text/plain, Size: 158 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-11 12:32                               ` Avi Kivity
@ 2008-05-11 13:36                                 ` Avi Kivity
  2008-05-13  3:49                                   ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-11 13:36 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

Avi Kivity wrote:
> Avi Kivity wrote:
>>
>> I asked fo this thinking bypass_guest_pf may help show more 
>> information.  But thinking a bit more, it will not.
>>
>> I think I do know what the problem is.  I will try it out.  Is there 
>> a free clone (like centos) available somewhere?
>
> This patch tracks down emulated accesses to speculated ptes and marks 
> them as accessed, preventing the flooding on centos-3.1.  
> Unfortunately it also causes a host oops midway through the boot process.
>
> I believe the oops is merely exposed by the patch, not caused by it.
>

It was caused by the patch, please try the updated one attached.

-- 
error compiling committee.c: too many arguments to function


[-- Attachment #2: prevent-kscand-flooding.patch --]
[-- Type: text/x-patch, Size: 2473 bytes --]

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3d769c3..012e8ad 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1127,8 +1127,10 @@ unshadowed:
 		else
 			kvm_release_pfn_clean(pfn);
 	}
-	if (!ptwrite || !*ptwrite)
+	if (speculative) {
 		vcpu->arch.last_pte_updated = shadow_pte;
+		vcpu->arch.last_pte_gfn = gfn;
+	}
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -1674,6 +1676,18 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	vcpu->arch.update_pte.pfn = pfn;
 }
 
+static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u64 *spte = vcpu->arch.last_pte_updated;
+
+	if (spte
+	    && vcpu->arch.last_pte_gfn == gfn
+	    && shadow_accessed_mask
+	    && !(*spte & shadow_accessed_mask)
+	    && is_shadow_present_pte(*spte))
+		set_bit(PT_ACCESSED_SHIFT, spte);
+}
+
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		       const u8 *new, int bytes)
 {
@@ -1697,13 +1711,14 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
 	mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes);
 	spin_lock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_access_page(vcpu, gfn);
 	kvm_mmu_free_some_pages(vcpu);
 	++vcpu->kvm->stat.mmu_pte_write;
 	kvm_mmu_audit(vcpu, "pre pte write");
 	if (gfn == vcpu->arch.last_pt_write_gfn
 	    && !last_updated_pte_accessed(vcpu)) {
 		++vcpu->arch.last_pt_write_count;
-		if (vcpu->arch.last_pt_write_count >= 3)
+		if (vcpu->arch.last_pt_write_count >= 5)
 			flooded = 1;
 	} else {
 		vcpu->arch.last_pt_write_gfn = gfn;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 1730757..258e5d5 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -15,7 +15,8 @@
 #define PT_USER_MASK (1ULL << 2)
 #define PT_PWT_MASK (1ULL << 3)
 #define PT_PCD_MASK (1ULL << 4)
-#define PT_ACCESSED_MASK (1ULL << 5)
+#define PT_ACCESSED_SHIFT 5
+#define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
 #define PT_DIRTY_MASK (1ULL << 6)
 #define PT_PAGE_SIZE_MASK (1ULL << 7)
 #define PT_PAT_MASK (1ULL << 7)
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 1d8cd01..0bdb392 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -242,6 +242,7 @@ struct kvm_vcpu_arch {
 	gfn_t last_pt_write_gfn;
 	int   last_pt_write_count;
 	u64  *last_pte_updated;
+	gfn_t last_pte_gfn;
 
 	struct {
 		gfn_t gfn;	/* presumed gfn during guest pte update */

[-- Attachment #3: Type: text/plain, Size: 320 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

[-- Attachment #4: Type: text/plain, Size: 158 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-11 13:36                                 ` Avi Kivity
@ 2008-05-13  3:49                                   ` David S. Ahern
  2008-05-13  7:25                                     ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-13  3:49 UTC (permalink / raw)
  To: Avi Kivity, kvm-devel

That does the trick with kscand.

Do you have recommendations for clock source settings? For example in my
test case for this patch the guest gained 73 seconds (ahead of real
time) after only 3 hours, 5 min of uptime.

thanks,

david


Avi Kivity wrote:
> Avi Kivity wrote:
>> Avi Kivity wrote:
>>>
>>> I asked fo this thinking bypass_guest_pf may help show more
>>> information.  But thinking a bit more, it will not.
>>>
>>> I think I do know what the problem is.  I will try it out.  Is there
>>> a free clone (like centos) available somewhere?
>>
>> This patch tracks down emulated accesses to speculated ptes and marks
>> them as accessed, preventing the flooding on centos-3.1. 
>> Unfortunately it also causes a host oops midway through the boot process.
>>
>> I believe the oops is merely exposed by the patch, not caused by it.
>>
> 
> It was caused by the patch, please try the updated one attached.
> 
> 
> ------------------------------------------------------------------------
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
> Don't miss this year's exciting event. There's still time to save $100. 
> Use priority code J8TL2D2. 
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> kvm-devel mailing list
> kvm-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/kvm-devel

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-13  3:49                                   ` David S. Ahern
@ 2008-05-13  7:25                                     ` Avi Kivity
  2008-05-14 20:35                                       ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-13  7:25 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> That does the trick with kscand.
>
>   

Not so fast...  the patch updates the flood count to 5.  Can you check 
if a lower value still works?  Also, whether updating the flood count to 
5 (without the rest of the patch) works?

Unconditionally bumping the flood count to 5 will likely cause a 
performance regression on other guests.

While I was able to see excessive flooding, I couldn't reproduce your 
kscand problem.  Running /bin/true always returned immediately for me.

> Do you have recommendations for clock source settings? For example in my
> test case for this patch the guest gained 73 seconds (ahead of real
> time) after only 3 hours, 5 min of uptime.
>   

The kernel is trying to correlate tsc and pit, which isn't going to work.

Try disabling the tsc, set edx.bit4=0 for cpuid.eax=1 in qemu-kvm-x86  
.c do_cpuid_ent().

-- 
error compiling committee.c: too many arguments to function

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-13  7:25                                     ` Avi Kivity
@ 2008-05-14 20:35                                       ` David S. Ahern
  2008-05-15 10:53                                         ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-14 20:35 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

Avi Kivity wrote:
> Not so fast...  the patch updates the flood count to 5.  Can you check
> if a lower value still works?  Also, whether updating the flood count to
> 5 (without the rest of the patch) works?
> 
> Unconditionally bumping the flood count to 5 will likely cause a
> performance regression on other guests.

I put the flood count back to 3, and the RHEL3 guest performance is even
better.

> 
> While I was able to see excessive flooding, I couldn't reproduce your
> kscand problem.  Running /bin/true always returned immediately for me.

A poor attempt at finding a simplistic, minimal re-create. The use case
I am investigating has over 500 processes/threads with a base memory
consumption around 1GB. I was finding it nearly impossible to have a
generic re-create of the problem for you to use in your investigations
on CentOS.

Thanks for the patch.

david

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-14 20:35                                       ` David S. Ahern
@ 2008-05-15 10:53                                         ` Avi Kivity
  2008-05-17  4:31                                           ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-15 10:53 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm-devel

David S. Ahern wrote:
> Avi Kivity wrote:
>   
>> Not so fast...  the patch updates the flood count to 5.  Can you check
>> if a lower value still works?  Also, whether updating the flood count to
>> 5 (without the rest of the patch) works?
>>
>> Unconditionally bumping the flood count to 5 will likely cause a
>> performance regression on other guests.
>>     
>
> I put the flood count back to 3, and the RHEL3 guest performance is even
> better.
>
>   

Okay, I committed the patch without the flood count == 5.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-15 10:53                                         ` Avi Kivity
@ 2008-05-17  4:31                                           ` David S. Ahern
       [not found]                                             ` <482FCEE1.5040306@qumranet.com>
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-17  4:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

[-- Attachment #1: Type: text/plain, Size: 1092 bytes --]


Avi Kivity wrote:
> 
> Okay, I committed the patch without the flood count == 5.
> 

I've continued testing the RHEL3 guests with the flood count at 3, and I
am right back to where I started. With the patch and the flood count at
3, I had 2 runs totaling around 24 hours that looked really good. Now, I
am back to square one. I guess the short of it is that I am not sure if
the patch resolves this issue or not.

If you want to back it out, I can continue to apply on my end as I
continue testing. A snapshot of kvm_stat -f 'mmu*' -l is attached for
two test runs with the patch (line wrap is horrible inline).

I will work on creating an ap that will stimulate kscand activity
similar to what I am seeing.


Also, in a prior e-mail I mentioned guest time advancing rapidly. I've
noticed that with the -no-kvm-pit option the guest time is much better
and typically stays within 3 seconds or so of the host, even through the
high kscand activity which is one instance of when I've noticed time
jumps with the kernel pit. Yes, this result has been repeatable through
6 or so runs. :-)

david

[-- Attachment #2: kvm-stats-rhel3 --]
[-- Type: text/plain, Size: 4102 bytes --]

kvm-68 with Avi's patch and flood threshold at 3:


mmio_exit  mmu_cache  mmu_flood  mmu_pde_z  mmu_pte_u  mmu_pte_w  mmu_recyc  mmu_shado
       175        880        880          0       1832       2714          0        880
        35        868        868          0       1782       2650          0        868
        91       8522       8520        131      29179      38651          0       8722
        28        991        992          0       2314       3312          0        992
        91        796        796          0       1648       2445          0        796
        81       1944       1943          0       7241       9213          0       1943
        98       4149       4148         31      11975      16196          0       4214
        41       3379       3380          0       9710      13100          0       3378
        42      17729      17730          0      48415      66152          0      17729

guest has an apparent lockup at this point and when it unfreezes kscand cpu
time jumps on the order of the time command line response was frozen (on the 
order of 30 seconds or more)

        14      18634      18633          0      48286      66921          0      18634
        21      18607      18607          0      48395      67001          0      18607
        91      17991      17991          0      50039      68040          0      17991
         7      17919      17920          0      53731      71650          0      17919
         7      18060      18060          0      53539      71599          0      18060
        21      17755      17755          0      52714      70469          0      17755


-----------------------

with Avi's patch and flood threshold at 5.

 mmio_exit  mmu_cache  mmu_flood  mmu_pde_z  mmu_pte_u  mmu_pte_w  mmu_recyc  mmu_shado
       147        604        602         42      21299      21957          0        660
       112        163        167         23       7567       7759          0        170
       105          0          1          2       3378       3381          0          1
        14          4          4          0       9685       9689          0          4
       137        628        623         43      21557      22255          0        682
        42          0          2          4       5834       5840          0          2
        91         14         16          0      25741      25757          0         16
        28         58         55          0      23571      23626          0         55
        84        627        624         45      32588      33268          0        685
       132          9         13          1      12162      12177          0         13
        91          0          1          0       3422       3423          0          1
        35          1          1          0       4624       4625          0          1
       102        237        244          0      12257      12504          0        242
        19        401        387         46      20643      21088          0        449
        26          3          4          1     127252     127261          0          4


guest has an apparent lockup at this point and when it unfreezes kscand cpu
time jumps on the order of the time command line response was frozen (on the 
order of 30 seconds or more)


        21          0          0          0     182651     182651          0          0
        14          0          0          0     182524     182523          0          0
       178          4          5          4     170752     170759          0          5
        35          0          0          0     181471     181473          0          0
        21          0          0          0     182263     182263          0          0
        14          0          0          0     182493     182494          0          0
        21          0          0          0     182489     182488          0          0
        91          0          0          0     182203     182204          0          0
        35          0          0          0     182378     182377          0          0

[-- Attachment #3: Type: text/plain, Size: 230 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

[-- Attachment #4: Type: text/plain, Size: 158 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
       [not found]                                               ` <4830F90A.1020809@cisco.com>
@ 2008-05-19  4:14                                                 ` David S. Ahern
  2008-05-19 14:27                                                   ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-19  4:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

[resend to new list].


David S. Ahern wrote:
> I was just digging through the sysstat history files, and I was not
> imagining it: I did have an excellent overnight run on 5/13-5/14 with
> your patch and the standard RHEL3U8 smp kernel in the guest. I have no
> idea why I cannot get anywhere close to that again. I have updated quite
> a few variables since then (such as going from 2.6.25-rc8 to 2.6.25.3
> kernel in the host), but backing them out (i.e., resetting the test to
> my recollection of all the details of 5/14) has not helped. baffling and
> frustrating.
> 
> more in-line below.
> 
> 
> Avi Kivity wrote:
>> David S. Ahern wrote:
>>> Avi Kivity wrote:
>>>  
>>>> Okay, I committed the patch without the flood count == 5.
>>>>
>>>>     
>>> I've continued testing the RHEL3 guests with the flood count at 3, and I
>>> am right back to where I started. With the patch and the flood count at
>>> 3, I had 2 runs totaling around 24 hours that looked really good. Now, I
>>> am back to square one. I guess the short of it is that I am not sure if
>>> the patch resolves this issue or not.
>>>
>>>   
>> What about with the flood count at 5?  Does it reliably improve
>> performance?
>>
> 
> [dsa] No. I saw the same problem with the flood count at 5. The
> attachment in the last email shows kvm_stat data during a kscand event.
> The data was collected with the patch you posted. With the flood count
> at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates
> at ~50,000/sec and writes at 70,000/sec. With the flood count at 5
> mmu_cache/flood drops to 0 and pte updates and writes both hit
> 180,000+/second. In both cases these last for 30 seconds or more. I only
> included data for the onset as it's pretty flat during the kscand activity.
> 
>>> Also, in a prior e-mail I mentioned guest time advancing rapidly. I've
>>> noticed that with the -no-kvm-pit option the guest time is much better
>>> and typically stays within 3 seconds or so of the host, even through the
>>> high kscand activity which is one instance of when I've noticed time
>>> jumps with the kernel pit. Yes, this result has been repeatable through
>>> 6 or so runs. :-)
>>>   
>> Strange.  The in-kernel PIT was supposed to improve accuracy.
>>
> 
> [dsa] I started a run with the RHEL4 guest 8 hours ago and it is showing
> the same kind of success. With the in-kernel PIT, time in the guest
> advanced ~120 seconds over real time after just 2 days of up time. With
> the userspace PIT, time in the guest is behind real time by only 1
> second after 8 hours of uptime. Note that I am running the RHEL4.6
> kernel recompiled with HZ at 250 instead of the usual 1000.
> 
> david
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-19  4:14                                                 ` [kvm-devel] " David S. Ahern
@ 2008-05-19 14:27                                                   ` Avi Kivity
  2008-05-19 16:25                                                     ` David S. Ahern
  2008-05-20 14:19                                                     ` Avi Kivity
  0 siblings, 2 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-19 14:27 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

David S. Ahern wrote:
>> [dsa] No. I saw the same problem with the flood count at 5. The
>> attachment in the last email shows kvm_stat data during a kscand event.
>> The data was collected with the patch you posted. With the flood count
>> at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates
>> at ~50,000/sec and writes at 70,000/sec. With the flood count at 5
>> mmu_cache/flood drops to 0 and pte updates and writes both hit
>> 180,000+/second. In both cases these last for 30 seconds or more. I only
>> included data for the onset as it's pretty flat during the kscand activity.
>>     

It makes sense.  We removed a flooding false positive, and introduced a 
false negative.

The guest access sequence is:
- point kmap pte at page table
- use the new pte to access the page table

Prior to the patch, the mmu didn't see the 'use' part, so it concluded 
the kmap pte would be better off unshadowed.  This shows up as a high 
flood count.

After the patch, this no longer happens, so the sequence can repreat for 
long periods.  However the pte that is the result of the 'use' part is 
never accessed, so it should be detected as flooded!  But our flood 
detection mechanism looks at one page at a time (per vcpu), while there 
are two pages involved here.

There are (at least) three options available:
- detect and special-case this scenario
- change the flood detector to be per page table instead of per vcpu
- change the flood detector to look at a list of recently used page 
tables instead of the last page table

I'm having a hard time trying to pick between the second and third options.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-19 14:27                                                   ` Avi Kivity
@ 2008-05-19 16:25                                                     ` David S. Ahern
  2008-05-19 17:04                                                       ` Avi Kivity
  2008-05-20 14:19                                                     ` Avi Kivity
  1 sibling, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-19 16:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Does the fact that the hugemem kernel works just fine have any bearing
on your options? Or rather, is there something unique about the way
kscand works in the hugemem kernel that its performance is ok?

I mentioned last month (so without your first patch) that running the
hugemem kernel showed a remarkable improvement in performance compared
to the standard smp kernel. Over the weekend I ran a test with your
first patch and with the flood detector at 3 (I have not run a case with
the detector at 5) and performance with the hugemem was even better in
the sense that 1-minute averages of guest system time show no noticeable
spikes.

In an earlier post I showed a diff in the config files for the standard
SMP and hugemem kernels. See:
http://article.gmane.org/gmane.comp.emulators.kvm.devel/16944/

david



Avi Kivity wrote:
> David S. Ahern wrote:
>>> [dsa] No. I saw the same problem with the flood count at 5. The
>>> attachment in the last email shows kvm_stat data during a kscand event.
>>> The data was collected with the patch you posted. With the flood count
>>> at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates
>>> at ~50,000/sec and writes at 70,000/sec. With the flood count at 5
>>> mmu_cache/flood drops to 0 and pte updates and writes both hit
>>> 180,000+/second. In both cases these last for 30 seconds or more. I only
>>> included data for the onset as it's pretty flat during the kscand
>>> activity.
>>>     
> 
> It makes sense.  We removed a flooding false positive, and introduced a
> false negative.
> 
> The guest access sequence is:
> - point kmap pte at page table
> - use the new pte to access the page table
> 
> Prior to the patch, the mmu didn't see the 'use' part, so it concluded
> the kmap pte would be better off unshadowed.  This shows up as a high
> flood count.
> 
> After the patch, this no longer happens, so the sequence can repreat for
> long periods.  However the pte that is the result of the 'use' part is
> never accessed, so it should be detected as flooded!  But our flood
> detection mechanism looks at one page at a time (per vcpu), while there
> are two pages involved here.
> 
> There are (at least) three options available:
> - detect and special-case this scenario
> - change the flood detector to be per page table instead of per vcpu
> - change the flood detector to look at a list of recently used page
> tables instead of the last page table
> 
> I'm having a hard time trying to pick between the second and third options.
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-19 16:25                                                     ` David S. Ahern
@ 2008-05-19 17:04                                                       ` Avi Kivity
  0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-19 17:04 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

David S. Ahern wrote:
> Does the fact that the hugemem kernel works just fine have any bearing
> on your options? Or rather, is there something unique about the way
> kscand works in the hugemem kernel that its performance is ok?
>
>   

Yes.  If your guest has < 4GB of memory, then all of it is lowmem in the 
hugemem kernel, and the two-step process for modifying a pte is 
short-circuited into just one step, and everything works fine.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-19 14:27                                                   ` Avi Kivity
  2008-05-19 16:25                                                     ` David S. Ahern
@ 2008-05-20 14:19                                                     ` Avi Kivity
  2008-05-20 14:34                                                       ` Avi Kivity
  2008-05-22 22:08                                                       ` David S. Ahern
  1 sibling, 2 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-20 14:19 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

Avi Kivity wrote:
>
> There are (at least) three options available:
> - detect and special-case this scenario
> - change the flood detector to be per page table instead of per vcpu
> - change the flood detector to look at a list of recently used page 
> tables instead of the last page table
>
> I'm having a hard time trying to pick between the second and third 
> options.
>

The answer turns out to be "yes", so here's a patch that adds a pte 
access history table for each shadowed guest page-table.  Let me know if 
it helps.  Benchmarking a variety of workloads on all guests supported 
by kvm is left as an exercise for the reader, but I suspect the patch 
will either improve things all around, or can be modified to do so.

-- 
error compiling committee.c: too many arguments to function


[-- Attachment #2: per-page-pte-history.patch --]
[-- Type: text/x-patch, Size: 4637 bytes --]

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 154727d..1a3d01a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1130,7 +1130,8 @@ unshadowed:
 	if (speculative) {
 		vcpu->arch.last_pte_updated = shadow_pte;
 		vcpu->arch.last_pte_gfn = gfn;
-	}
+	} else
+		page_header(__pa(shadow_pte))->pte_history_len = 0;
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -1616,13 +1617,6 @@ static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, u64 old, u64 new)
 		kvm_mmu_flush_tlb(vcpu);
 }
 
-static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
-{
-	u64 *spte = vcpu->arch.last_pte_updated;
-
-	return !!(spte && (*spte & shadow_accessed_mask));
-}
-
 static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 					  const u8 *new, int bytes)
 {
@@ -1679,13 +1673,49 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	u64 *spte = vcpu->arch.last_pte_updated;
+	struct kvm_mmu_page *page;
+
+	if (spte && vcpu->arch.last_pte_gfn == gfn) {
+		page = page_header(__pa(spte));
+		page->pte_history_len = 0;
+		pgprintk("clearing page history, gfn %x ent %lx\n",
+			 page->gfn, spte - page->spt);
+	}
+}
+
+static bool kvm_mmu_page_flooded(struct kvm_mmu_page *page)
+{
+	int i, j, ent, len;
 
-	if (spte
-	    && vcpu->arch.last_pte_gfn == gfn
-	    && shadow_accessed_mask
-	    && !(*spte & shadow_accessed_mask)
-	    && is_shadow_present_pte(*spte))
-		set_bit(PT_ACCESSED_SHIFT, spte);
+	len = page->pte_history_len;
+	for (i = len; i != 0; --i) {
+		ent = page->pte_history[i - 1];
+		if (test_bit(PT_ACCESSED_SHIFT, &page->spt[ent])) {
+			for (j = i; j < len; ++j)
+				page->pte_history[j-i] = page->pte_history[j];
+			page->pte_history_len = len - i;
+			return false;
+		}
+	}
+	if (page->pte_history_len < KVM_MAX_PTE_HISTORY)
+		return false;
+	return true;
+}
+
+static void kvm_mmu_log_pte_history(struct kvm_mmu_page *page, u64 *spte)
+{
+	int i;
+	unsigned ent = spte - page->spt;
+
+	if (page->pte_history_len > 0
+	    && page->pte_history[page->pte_history_len - 1] == ent)
+		return;
+	if (page->pte_history_len == KVM_MAX_PTE_HISTORY) {
+		for (i = 1; i < KVM_MAX_PTE_HISTORY; ++i)
+			page->pte_history[i-1] = page->pte_history[i];
+		--page->pte_history_len;
+	}
+	page->pte_history[page->pte_history_len++] = ent;
 }
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -1704,7 +1734,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	unsigned misaligned;
 	unsigned quadrant;
 	int level;
-	int flooded = 0;
 	int npte;
 	int r;
 
@@ -1715,16 +1744,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	kvm_mmu_free_some_pages(vcpu);
 	++vcpu->kvm->stat.mmu_pte_write;
 	kvm_mmu_audit(vcpu, "pre pte write");
-	if (gfn == vcpu->arch.last_pt_write_gfn
-	    && !last_updated_pte_accessed(vcpu)) {
-		++vcpu->arch.last_pt_write_count;
-		if (vcpu->arch.last_pt_write_count >= 3)
-			flooded = 1;
-	} else {
-		vcpu->arch.last_pt_write_gfn = gfn;
-		vcpu->arch.last_pt_write_count = 1;
-		vcpu->arch.last_pte_updated = NULL;
-	}
 	index = kvm_page_table_hashfn(gfn);
 	bucket = &vcpu->kvm->arch.mmu_page_hash[index];
 	hlist_for_each_entry_safe(sp, node, n, bucket, hash_link) {
@@ -1733,7 +1752,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		pte_size = sp->role.glevels == PT32_ROOT_LEVEL ? 4 : 8;
 		misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
 		misaligned |= bytes < 4;
-		if (misaligned || flooded) {
+		if (misaligned || kvm_mmu_page_flooded(sp)) {
 			/*
 			 * Misaligned accesses are too much trouble to fix
 			 * up; also, they usually indicate a page is not used
@@ -1785,6 +1804,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 			mmu_pte_write_zap_pte(vcpu, sp, spte);
 			if (new)
 				mmu_pte_write_new_pte(vcpu, sp, spte, new);
+			kvm_mmu_log_pte_history(sp, spte);
 			mmu_pte_write_flush_tlb(vcpu, entry, *spte);
 			++spte;
 		}
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index a71f3aa..cbe550e 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -78,6 +78,7 @@
 #define KVM_MIN_FREE_MMU_PAGES 5
 #define KVM_REFILL_PAGES 25
 #define KVM_MAX_CPUID_ENTRIES 40
+#define KVM_MAX_PTE_HISTORY 4
 
 extern spinlock_t kvm_lock;
 extern struct list_head vm_list;
@@ -189,6 +190,9 @@ struct kvm_mmu_page {
 		u64 *parent_pte;               /* !multimapped */
 		struct hlist_head parent_ptes; /* multimapped, kvm_pte_chain */
 	};
+
+	u16 pte_history_len;
+	u16 pte_history[KVM_MAX_PTE_HISTORY];
 };
 
 /*

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-20 14:19                                                     ` Avi Kivity
@ 2008-05-20 14:34                                                       ` Avi Kivity
  2008-05-22 22:08                                                       ` David S. Ahern
  1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-20 14:34 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

Avi Kivity wrote:
>
> The answer turns out to be "yes", so here's a patch that adds a pte 
> access history table for each shadowed guest page-table.  Let me know 
> if it helps.  Benchmarking a variety of workloads on all guests 
> supported by kvm is left as an exercise for the reader, but I suspect 
> the patch will either improve things all around, or can be modified to 
> do so.
>

btw, the patch applied on top of kvm HEAD (which includes the previous 
patch).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-20 14:19                                                     ` Avi Kivity
  2008-05-20 14:34                                                       ` Avi Kivity
@ 2008-05-22 22:08                                                       ` David S. Ahern
  2008-05-28 10:51                                                         ` Avi Kivity
  1 sibling, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-22 22:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]

The short answer is that I am still see large system time hiccups in the
guests due to kscand in the guest scanning its active lists. I do see
better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
completeness I also tried a history of 2, but it performed worse than 3
which is no surprise given the meaning of it.)


I have been able to scratch out a simplistic program that stimulates
kscand activity similar to what is going on in my real guest (see
attached). The program requests a memory allocation, initializes it (to
get it backed) and then in a loop sweeps through the memory in chunks
similar to a program using parts of its memory here and there but
eventually accessing all of it.

Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
using a fair amount of highmem. Start a couple of instances of the
attached. For example, I've been using these 2:

	memuser 768M 120 5 300
	memuser 384M 300 10 600

Together these instances take up a 1GB of RAM and once initialized
consume very little CPU. On kvm they make kscand and kswapd go nuts
every 5-15 minutes. For comparison, I do not see the same behavior for
an identical setup running on esx 3.5.

david



Avi Kivity wrote:
> Avi Kivity wrote:
>>
>> There are (at least) three options available:
>> - detect and special-case this scenario
>> - change the flood detector to be per page table instead of per vcpu
>> - change the flood detector to look at a list of recently used page
>> tables instead of the last page table
>>
>> I'm having a hard time trying to pick between the second and third
>> options.
>>
> 
> The answer turns out to be "yes", so here's a patch that adds a pte
> access history table for each shadowed guest page-table.  Let me know if
> it helps.  Benchmarking a variety of workloads on all guests supported
> by kvm is left as an exercise for the reader, but I suspect the patch
> will either improve things all around, or can be modified to do so.
> 

[-- Attachment #2: memuser.c --]
[-- Type: text/x-csrc, Size: 2621 bytes --]

/* simple program to malloc memory, inialize it, and
 * then repetitively use it to keep it active.
 */

#include <sys/time.h>
#include <time.h>

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <libgen.h>

/* goal is to sweep memory every T1 sec by accessing a
 * percentage at a time and sleeping T2 sec in between accesses.
 * Once all the memory has been accessed, sleep for T3 sec
 * before starting the cycle over.
 */
#define T1  180
#define T2  5
#define T3  300


const char *timestamp(void);

void usage(const char *prog) {
	fprintf(stderr, "\nusage: %s memlen{M|K}) [t1 t2 t3]\n", prog);
}


int main(int argc, char *argv[])
{
	int len;
	char *endp;
	int factor, endp_len;
	int start, incr;
	int t1 = T1, t2 = T2, t3 = T3;
	char *mem;
	char c = 0;

	if (argc < 2) {
		usage(basename(argv[0]));
		return 1;
	}


	/*
	 * determine memory to request
	 */
	len = (int) strtol(argv[1], &endp, 0);
	factor = 1;
	endp_len = strlen(endp);
	if ((endp_len == 1) && ((*endp == 'M') || (*endp == 'm')))
		factor = 1024 * 1024;
	else if ((endp_len == 1) && ((*endp == 'K') || (*endp == 'k')))
		factor = 1024;
	else if (endp_len) {
		fprintf(stderr, "invalid memory len.\n");
		return 1;
	}
	len *= factor;

	if (len == 0) {
		fprintf(stdout, "memory len is 0.\n");
		return 1;
	}


	/*
	 * convert times if given
	 */
	if (argc > 2) {
		if (argc < 5) {
			usage(basename(argv[0]));
			return 1;
		}

		t1 = atoi(argv[2]);
		t2 = atoi(argv[3]);
		t3 = atoi(argv[4]);
	}



	/*
	 *  amount of memory to sweep at one time
	 */
	if (t1 && t2)
		incr = len / t1 * t2;
	else
		incr = len;



	mem = (char *) malloc(len);
	if (mem == NULL) {
		fprintf(stderr, "malloc failed\n");
		return 1;
	}
	printf("memory allocated. initializing to 0\n");
	memset(mem, 0, len);

	start = 0;
	printf("%s starting memory update.\n", timestamp());
	while (1) {
		c++;
		if (c == 0x7f) c = 0;
		memset(mem + start, c, incr);
		start += incr;

		if ((start >= len) || ((start + incr) >= len)) {
			printf("%s scan complete. sleeping %d\n", 
                              timestamp(), t3);
			start = 0;
			sleep(t3);
			printf("%s starting memory update.\n", timestamp());
		} else if (t2)
			sleep(t2);
	}

	return 0;
}

const char *timestamp(void)
{
    static char date[64];
    struct timeval now;
    struct tm ltime;

    memset(date, 0, sizeof(date));

    if (gettimeofday(&now, NULL) == 0)
    {
        if (localtime_r(&now.tv_sec, &ltime))
            strftime(date, sizeof(date), "%m/%d %H:%M:%S", &ltime);
    }

    if (strlen(date) == 0)
        strcpy(date, "unknown");

    return date;
}

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-22 22:08                                                       ` David S. Ahern
@ 2008-05-28 10:51                                                         ` Avi Kivity
  2008-05-28 14:13                                                           ` David S. Ahern
  2008-05-29 16:42                                                           ` David S. Ahern
  0 siblings, 2 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-28 10:51 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

David S. Ahern wrote:
> The short answer is that I am still see large system time hiccups in the
> guests due to kscand in the guest scanning its active lists. I do see
> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
> completeness I also tried a history of 2, but it performed worse than 3
> which is no surprise given the meaning of it.)
>
>
> I have been able to scratch out a simplistic program that stimulates
> kscand activity similar to what is going on in my real guest (see
> attached). The program requests a memory allocation, initializes it (to
> get it backed) and then in a loop sweeps through the memory in chunks
> similar to a program using parts of its memory here and there but
> eventually accessing all of it.
>
> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
> using a fair amount of highmem. Start a couple of instances of the
> attached. For example, I've been using these 2:
>
> 	memuser 768M 120 5 300
> 	memuser 384M 300 10 600
>
> Together these instances take up a 1GB of RAM and once initialized
> consume very little CPU. On kvm they make kscand and kswapd go nuts
> every 5-15 minutes. For comparison, I do not see the same behavior for
> an identical setup running on esx 3.5.
>   

I haven't been able to reproduce this:

> [root@localhost root]# ps -elf | grep -E 'memuser|kscand'
> 1 S root         7     1  1  75   0    -     0 schedu 10:07 ?        
> 00:00:26 [kscand]
> 0 S root      1464     1  1  75   0    - 196986 schedu 10:20 pts/0   
> 00:00:21 ./memuser 768M 120 5 300
> 0 S root      1465     1  0  75   0    - 98683 schedu 10:20 pts/0    
> 00:00:10 ./memuser 384M 300 10 600
> 0 S root      2148  1293  0  75   0    -   922 pipe_w 10:48 pts/0    
> 00:00:00 grep -E memuser|kscand

The workload has been running for about half an hour, and kswapd cpu 
usage doesn't seem significant.  This is a 2GB guest running with my 
patch ported to kvm.git HEAD.  Guest is has 2G of memory.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 10:51                                                         ` Avi Kivity
@ 2008-05-28 14:13                                                           ` David S. Ahern
  2008-05-28 14:35                                                             ` Avi Kivity
  2008-05-28 14:48                                                             ` Andrea Arcangeli
  2008-05-29 16:42                                                           ` David S. Ahern
  1 sibling, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-05-28 14:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Weird. Could it be something about the hosts?

I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13
GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel.

I'll rebuild kvm-69 with your latest patch and try the test programs again.

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> The short answer is that I am still see large system time hiccups in the
>> guests due to kscand in the guest scanning its active lists. I do see
>> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
>> completeness I also tried a history of 2, but it performed worse than 3
>> which is no surprise given the meaning of it.)
>>
>>
>> I have been able to scratch out a simplistic program that stimulates
>> kscand activity similar to what is going on in my real guest (see
>> attached). The program requests a memory allocation, initializes it (to
>> get it backed) and then in a loop sweeps through the memory in chunks
>> similar to a program using parts of its memory here and there but
>> eventually accessing all of it.
>>
>> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
>> using a fair amount of highmem. Start a couple of instances of the
>> attached. For example, I've been using these 2:
>>
>>     memuser 768M 120 5 300
>>     memuser 384M 300 10 600
>>
>> Together these instances take up a 1GB of RAM and once initialized
>> consume very little CPU. On kvm they make kscand and kswapd go nuts
>> every 5-15 minutes. For comparison, I do not see the same behavior for
>> an identical setup running on esx 3.5.
>>   
> 
> I haven't been able to reproduce this:
> 
>> [root@localhost root]# ps -elf | grep -E 'memuser|kscand'
>> 1 S root         7     1  1  75   0    -     0 schedu 10:07 ?       
>> 00:00:26 [kscand]
>> 0 S root      1464     1  1  75   0    - 196986 schedu 10:20 pts/0  
>> 00:00:21 ./memuser 768M 120 5 300
>> 0 S root      1465     1  0  75   0    - 98683 schedu 10:20 pts/0   
>> 00:00:10 ./memuser 384M 300 10 600
>> 0 S root      2148  1293  0  75   0    -   922 pipe_w 10:48 pts/0   
>> 00:00:00 grep -E memuser|kscand
> 
> The workload has been running for about half an hour, and kswapd cpu
> usage doesn't seem significant.  This is a 2GB guest running with my
> patch ported to kvm.git HEAD.  Guest is has 2G of memory.
> 
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 14:13                                                           ` David S. Ahern
@ 2008-05-28 14:35                                                             ` Avi Kivity
  2008-05-28 19:49                                                               ` David S. Ahern
  2008-05-28 14:48                                                             ` Andrea Arcangeli
  1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-28 14:35 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

David S. Ahern wrote:
> Weird. Could it be something about the hosts?
>
> I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13
> GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel.
>
> I'll rebuild kvm-69 with your latest patch and try the test programs again.
>   

I've pushed it into kvm.git, branch name per-page-pte-tracking.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 14:13                                                           ` David S. Ahern
  2008-05-28 14:35                                                             ` Avi Kivity
@ 2008-05-28 14:48                                                             ` Andrea Arcangeli
  2008-05-28 14:57                                                               ` Avi Kivity
  2008-05-28 15:37                                                               ` Avi Kivity
  1 sibling, 2 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2008-05-28 14:48 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Avi Kivity, kvm

On Wed, May 28, 2008 at 08:13:44AM -0600, David S. Ahern wrote:
> Weird. Could it be something about the hosts?

Note that the VM itself will never make use of kmap. The VM is "data"
agonistic. The VM has never any idea with the data contained by the
pages. kmap/kmap_atomic/kunmap_atomic are only need to access _data_.

Only I/O (if not using DMA, or because of bounce buffers) and page
faults triggered in user process context, or other operations again
done from user process context will call into kmap or kmap_atomic.

And if KVM is inefficient in handling kmap/kmap_atomic that will lead
to the user process running slower, and in turn generating less
pressure to the guest and host VM if something. Guest will run slower
than it should if KVM isn't optimized for the workload but it
shouldn't alter any VM kernel thread CPU usage, only the CPU usage of
the guest process context and host system time in qemu task should go
up, nothing else. This is again because the VM will never care about
the data contents and it'll never invoked kmap/kmap_atomic.

So I never found a relation to the symptom reported of VM kernel
threads going weird, with KVM optimal handling of kmap ptes.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 14:48                                                             ` Andrea Arcangeli
@ 2008-05-28 14:57                                                               ` Avi Kivity
  2008-05-28 15:39                                                                 ` David S. Ahern
  2008-05-28 15:58                                                                 ` Andrea Arcangeli
  2008-05-28 15:37                                                               ` Avi Kivity
  1 sibling, 2 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-28 14:57 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David S. Ahern, kvm

Andrea Arcangeli wrote:
> On Wed, May 28, 2008 at 08:13:44AM -0600, David S. Ahern wrote:
>   
>> Weird. Could it be something about the hosts?
>>     
>
> Note that the VM itself will never make use of kmap. The VM is "data"
> agonistic. The VM has never any idea with the data contained by the
> pages. kmap/kmap_atomic/kunmap_atomic are only need to access _data_.
>
>   

What about CONFIG_HIGHPTE?



-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 14:48                                                             ` Andrea Arcangeli
  2008-05-28 14:57                                                               ` Avi Kivity
@ 2008-05-28 15:37                                                               ` Avi Kivity
  2008-05-28 15:43                                                                 ` David S. Ahern
  1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-28 15:37 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David S. Ahern, kvm

Andrea Arcangeli wrote:
>
> So I never found a relation to the symptom reported of VM kernel
> threads going weird, with KVM optimal handling of kmap ptes.
>   


The problem is this code:

static int scan_active_list(struct zone_struct * zone, int age,
                struct list_head * list)
{
        struct list_head *page_lru , *next;
        struct page * page;
        int over_rsslimit;

        /* Take the lock while messing with the list... */
        lru_lock(zone);
        list_for_each_safe(page_lru, next, list) {
                page = list_entry(page_lru, struct page, lru);
                pte_chain_lock(page);
                if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
                        age_page_up_nolock(page, age);
                pte_chain_unlock(page);
        }
        lru_unlock(zone);
        return 0;
}

If the pages in the list are in the same order as in the ptes (which is 
very likely), then we have the following access pattern

- set up kmap to point at pte
- test_and_clear_bit(pte)
- kunmap

 From kvm's point of view this looks like

- several accesses to set up the kmap
  - if these accesses trigger flooding, we will have to tear down the 
shadow for this page, only to set it up again soon
- an access to the pte (emulted)
  - if this access _doesn't_ trigger flooding, we will have 512 unneeded 
emulations.  The pte is worthless anyway since the accessed bit is clear 
(so we can't set up a shadow pte for it)
    - this bug was fixed
- an access to tear down the kmap

[btw, am I reading this right? the entire list is scanned each time?

if you have 1G of active HIGHMEM, that's a quarter of a million pages, 
which would take at least a second no matter what we do.  VMware can 
probably special-case kmaps, but we can't]

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 14:57                                                               ` Avi Kivity
@ 2008-05-28 15:39                                                                 ` David S. Ahern
  2008-05-29 11:49                                                                   ` Avi Kivity
  2008-05-29 12:10                                                                   ` Avi Kivity
  2008-05-28 15:58                                                                 ` Andrea Arcangeli
  1 sibling, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-05-28 15:39 UTC (permalink / raw)
  To: Avi Kivity, Andrea Arcangeli; +Cc: kvm

I've been instrumenting the guest kernel as well. It's the scanning of
the active lists that triggers a lot of calls to paging64_prefetch_page,
and, as you guys know, correlates with the number of direct pages in the
list. Earlier in this thread I traced the kvm cycles to
paging64_prefetch_page(). See

http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg16332.html

In the guest I started capturing scans (kscand() loop) that took longer
than a jiffie. Here's an example for 1 trip through the active lists,
both anonymous and cache:

active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct
36234, dj 225

active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct 1249, dj 3

active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct
84829, dj 848

active_cache_scan: HighMem, age 12, count[age] 3397 -> 2640, direct 889,
dj 19

active_cache_scan: HighMem, age 8, count[age] 6105 -> 5884, direct 988,
dj 24

active_cache_scan: HighMem, age 4, count[age] 18923 -> 18400, direct
11141, dj 37

active_cache_scan: HighMem, age 0, count[age] 14283 -> 14283, direct 69,
dj 1

An explanation of the line (using the first one): it's a scan of the
anonymous list, age bucket of 4. Before the scan loop the bucket had
41863 pages and after the loop the bucket had 30194. Of the pages in the
bucket 36234 were direct pages(ie., PageDirect(page) was non-zero) and
for this bucket 225 jiffies passed while running scan_active_list().

On the host side the total times (sum of the dj's/100) in the output
above directly match with kvm_stat output, spikes in pte_writes/updates.

Tracing the RHEL3 code I believe linux-2.4.21-rmap.patch is the patch
that brought in the code that is run during the active list scans for
direct pgaes. In and of itself each trip through the while loop in
scan_active_list does not take a lot of time, but when run say 84,829
times (see age 0 above) the cumulative time is high, 8.48 seconds per
the example above.

I'll pull down the git branch and give it a spin.

david

Avi Kivity wrote:
> Andrea Arcangeli wrote:
>> On Wed, May 28, 2008 at 08:13:44AM -0600, David S. Ahern wrote:
>>  
>>> Weird. Could it be something about the hosts?
>>>     
>>
>> Note that the VM itself will never make use of kmap. The VM is "data"
>> agonistic. The VM has never any idea with the data contained by the
>> pages. kmap/kmap_atomic/kunmap_atomic are only need to access _data_.
>>
>>   
> 
> What about CONFIG_HIGHPTE?
> 
> 
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 15:37                                                               ` Avi Kivity
@ 2008-05-28 15:43                                                                 ` David S. Ahern
  2008-05-28 17:04                                                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-28 15:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andrea Arcangeli, kvm

This is the code in the RHEL3.8 kernel:

static int scan_active_list(struct zone_struct * zone, int age,
		struct list_head * list, int count)
{
	struct list_head *page_lru , *next;
	struct page * page;
	int over_rsslimit;

	count = count * kscand_work_percent / 100;
	/* Take the lock while messing with the list... */
	lru_lock(zone);
	while (count-- > 0 && !list_empty(list)) {
		page = list_entry(list->prev, struct page, lru);
		pte_chain_lock(page);
		if (page_referenced(page, &over_rsslimit)
				&& !over_rsslimit
				&& check_mapping_inuse(page))
			age_page_up_nolock(page, age);
		else {
			list_del(&page->lru);
			list_add(&page->lru, list);
		}
		pte_chain_unlock(page);
	}
	lru_unlock(zone);
	return 0;
}

My previous email shows examples of the number of pages in the list and
the scanning that happens.

david


Avi Kivity wrote:
> Andrea Arcangeli wrote:
>>
>> So I never found a relation to the symptom reported of VM kernel
>> threads going weird, with KVM optimal handling of kmap ptes.
>>   
> 
> 
> The problem is this code:
> 
> static int scan_active_list(struct zone_struct * zone, int age,
>                struct list_head * list)
> {
>        struct list_head *page_lru , *next;
>        struct page * page;
>        int over_rsslimit;
> 
>        /* Take the lock while messing with the list... */
>        lru_lock(zone);
>        list_for_each_safe(page_lru, next, list) {
>                page = list_entry(page_lru, struct page, lru);
>                pte_chain_lock(page);
>                if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
>                        age_page_up_nolock(page, age);
>                pte_chain_unlock(page);
>        }
>        lru_unlock(zone);
>        return 0;
> }
> 
> If the pages in the list are in the same order as in the ptes (which is
> very likely), then we have the following access pattern
> 
> - set up kmap to point at pte
> - test_and_clear_bit(pte)
> - kunmap
> 
> From kvm's point of view this looks like
> 
> - several accesses to set up the kmap
>  - if these accesses trigger flooding, we will have to tear down the
> shadow for this page, only to set it up again soon
> - an access to the pte (emulted)
>  - if this access _doesn't_ trigger flooding, we will have 512 unneeded
> emulations.  The pte is worthless anyway since the accessed bit is clear
> (so we can't set up a shadow pte for it)
>    - this bug was fixed
> - an access to tear down the kmap
> 
> [btw, am I reading this right? the entire list is scanned each time?
> 
> if you have 1G of active HIGHMEM, that's a quarter of a million pages,
> which would take at least a second no matter what we do.  VMware can
> probably special-case kmaps, but we can't]
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 14:57                                                               ` Avi Kivity
  2008-05-28 15:39                                                                 ` David S. Ahern
@ 2008-05-28 15:58                                                                 ` Andrea Arcangeli
  1 sibling, 0 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2008-05-28 15:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: David S. Ahern, kvm

On Wed, May 28, 2008 at 05:57:21PM +0300, Avi Kivity wrote:
> What about CONFIG_HIGHPTE?

Ah yes sorry! Official 2.4 has no highpte capability but surely RH
backported highpte to 2.4 so that would explain the cpu time spent in
kswapd _guest_ context.

If highpte is the problem and you've troubles reproducing, I recommend
running some dozen of those in background on the 2.4 VM that has the
ZERO_PAGE support immediately after boot. This will ensure there will
be tons of pagetables in highmemory. This should allocate purely
pagetables and allow for a worst case of highpte.  Check with
/proc/meminfo that the pagetable number goes up of a few megabytes for
each one of those tasks. Then just try to allocate some real ram (not
zeropage) and if there's a problem with highptes it should be possible
to reproduce it with so many highptes allocated in the system. Guest
VM size should be 2G, you don't really need more than 2G to reproduce
by using the below ZERO_PAGE trick.

#include <unistd.h>
#include <stdlib.h>
#include <string.h>

int main()
{
	char *p1, *p2;
	p1 = malloc(512*1024*1024);
	p2 = malloc(512*1024*1024);
	if (memcmp(p1, p2, 512*1024*1024))
		*(char *)0 = 0;
	pause();

	return 0;
}

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 15:43                                                                 ` David S. Ahern
@ 2008-05-28 17:04                                                                   ` Andrea Arcangeli
  2008-05-28 17:24                                                                     ` David S. Ahern
  2008-05-29 10:01                                                                     ` Avi Kivity
  0 siblings, 2 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2008-05-28 17:04 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Avi Kivity, kvm

On Wed, May 28, 2008 at 09:43:09AM -0600, David S. Ahern wrote:
> This is the code in the RHEL3.8 kernel:
> 
> static int scan_active_list(struct zone_struct * zone, int age,
> 		struct list_head * list, int count)
> {
> 	struct list_head *page_lru , *next;
> 	struct page * page;
> 	int over_rsslimit;
> 
> 	count = count * kscand_work_percent / 100;
> 	/* Take the lock while messing with the list... */
> 	lru_lock(zone);
> 	while (count-- > 0 && !list_empty(list)) {
> 		page = list_entry(list->prev, struct page, lru);
> 		pte_chain_lock(page);
> 		if (page_referenced(page, &over_rsslimit)
> 				&& !over_rsslimit
> 				&& check_mapping_inuse(page))
> 			age_page_up_nolock(page, age);
> 		else {
> 			list_del(&page->lru);
> 			list_add(&page->lru, list);
> 		}
> 		pte_chain_unlock(page);
> 	}
> 	lru_unlock(zone);
> 	return 0;
> }
> 
> My previous email shows examples of the number of pages in the list and
> the scanning that happens.

This code looks better than the one below, as a limit was introduced
and the whole list isn't scanned anymore, if you decrease
kscand_work_percent (I assume it's a sysctl even if it's missing the
sysctl_ prefix) to say 1, you can limit damages. Did you try it?

> Avi Kivity wrote:
> > Andrea Arcangeli wrote:
> >>
> >> So I never found a relation to the symptom reported of VM kernel
> >> threads going weird, with KVM optimal handling of kmap ptes.
> >>   
> > 
> > 
> > The problem is this code:
> > 
> > static int scan_active_list(struct zone_struct * zone, int age,
> >                struct list_head * list)
> > {
> >        struct list_head *page_lru , *next;
> >        struct page * page;
> >        int over_rsslimit;
> > 
> >        /* Take the lock while messing with the list... */
> >        lru_lock(zone);
> >        list_for_each_safe(page_lru, next, list) {
> >                page = list_entry(page_lru, struct page, lru);
> >                pte_chain_lock(page);
> >                if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
> >                        age_page_up_nolock(page, age);
> >                pte_chain_unlock(page);
> >        }
> >        lru_unlock(zone);
> >        return 0;
> > }
>
> > If the pages in the list are in the same order as in the ptes (which is
> > very likely), then we have the following access pattern

Yes it is likely.

> > - set up kmap to point at pte
> > - test_and_clear_bit(pte)
> > - kunmap
> > 
> > From kvm's point of view this looks like
> > 
> > - several accesses to set up the kmap

Hmm, the kmap establishment takes a single guest operation in the
fixmap area. That's a single write to the pte, to write a pte_t 8/4
byte large region (PAE/non-PAE). The same pte_t is then cleared and
flushed out of the tlb with a cpu-local invlpg during kunmap_atomic.

I count 1 write here so far.

> >  - if these accesses trigger flooding, we will have to tear down the
> > shadow for this page, only to set it up again soon

So the shadow mapping the fixmap area would be tear down by the
flooding.

Or is the shadow corresponding to the real user pte pointed by the
fixmap, that is unshadowed by the flooding, or both/all?

> > - an access to the pte (emulted)

Here I count the second write and this isn't done on the fixmap area
like the first write above, but this is a write to the real user pte,
pointed by the fixmap. So if this is emulated it means the shadow of
the user pte pointing to the real data page is still active.

> >  - if this access _doesn't_ trigger flooding, we will have 512 unneeded
> > emulations.  The pte is worthless anyway since the accessed bit is clear
> > (so we can't set up a shadow pte for it)
> >    - this bug was fixed

You mean the accessed bit on fixmap pte used by kmap? Or the user pte
pointed by the fixmap pte?

> > - an access to tear down the kmap

Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
matters).

> > [btw, am I reading this right? the entire list is scanned each time?

If the list parameter isn't a local LIST_HEAD on the stack but the
global one it's a full scan each time. I guess it's the global list
looking at the new code at the top that has a kswapd_scan_limit
sysctl.

> > if you have 1G of active HIGHMEM, that's a quarter of a million pages,
> > which would take at least a second no matter what we do.  VMware can
> > probably special-case kmaps, but we can't]

Perhaps they've a list per-age bucket or similar but still I doubt
this works well on host either... I guess the virtualization overhead
is exacerbating the inefficiency. Perhaps killall -STOP kscand is good
enough fix ;). This seem to only push the age up, to be functional the
age has to go down and I guess the go-down is done by other threads so
stopping kscand may not hurt.

I think what we should aim for is to quickly reach this condition:

1) always keep the fixmap/kmap pte_t shadowed and emulate the
   kmap/kunmap access so the test_and_clear_young done on the user pte
   doesn't require to re-establish the spte representing the fixmap
   virtual address. If we don't emulate fixmap we'll have to
   re-establish the spte during the write to the user pte, and
   tear it down again during kunmap_atomic. So there's not much doubt
   fixmap access emulation is worth it.

2) get rid of the user pte shadow mapping pointing to the user data so
   the test_and_clear of the young bitflag on the user pte will not be
   emulated and it'll run at full CPU speed through the shadow pte
   mapping corresponding to the fixmap virtual address

kscand pattern is the same as running mprotect on a 32bit 2.6
kernel so it sounds worth optimizing for it, even if kscand may be
unfixable without killall -STOP kscand or VM fixes to guest.

However I'm not sure about point 2 at the light of mprotect. With
mprotect the guest virutal addresses mapped by the guest user ptes
will be used. It's not like kscand that may write forever to the user
ptes without ever using the guest virtual addresses that they're
mapping. So we better be sure that by unshadowing and optimizing
kscand we're not hurting mprotect or other pte mangling operations in
2.6 that will likely keep accessing the guest virtual addresses mapped
by the user ptes previously modified.

Hope this makes any sense, I'm not sure to understand this completely.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 17:04                                                                   ` Andrea Arcangeli
@ 2008-05-28 17:24                                                                     ` David S. Ahern
  2008-05-29 10:01                                                                     ` Avi Kivity
  1 sibling, 0 replies; 73+ messages in thread
From: David S. Ahern @ 2008-05-28 17:24 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Avi Kivity, kvm

Yes, I've tried changing kscand_work_percent (values of 50 and 30).
Basically it makes kscand wake more often (ie.,MIN_AGING_INTERVAL
declines in proportion) put do less work each trip through the lists.

I have not seen a noticeable change in guest behavior.

david


Andrea Arcangeli wrote:
> On Wed, May 28, 2008 at 09:43:09AM -0600, David S. Ahern wrote:
>> This is the code in the RHEL3.8 kernel:
>>
>> static int scan_active_list(struct zone_struct * zone, int age,
>> 		struct list_head * list, int count)
>> {
>> 	struct list_head *page_lru , *next;
>> 	struct page * page;
>> 	int over_rsslimit;
>>
>> 	count = count * kscand_work_percent / 100;
>> 	/* Take the lock while messing with the list... */
>> 	lru_lock(zone);
>> 	while (count-- > 0 && !list_empty(list)) {
>> 		page = list_entry(list->prev, struct page, lru);
>> 		pte_chain_lock(page);
>> 		if (page_referenced(page, &over_rsslimit)
>> 				&& !over_rsslimit
>> 				&& check_mapping_inuse(page))
>> 			age_page_up_nolock(page, age);
>> 		else {
>> 			list_del(&page->lru);
>> 			list_add(&page->lru, list);
>> 		}
>> 		pte_chain_unlock(page);
>> 	}
>> 	lru_unlock(zone);
>> 	return 0;
>> }
>>
>> My previous email shows examples of the number of pages in the list and
>> the scanning that happens.
> 
> This code looks better than the one below, as a limit was introduced
> and the whole list isn't scanned anymore, if you decrease
> kscand_work_percent (I assume it's a sysctl even if it's missing the
> sysctl_ prefix) to say 1, you can limit damages. Did you try it?
> 
>> Avi Kivity wrote:
>>> Andrea Arcangeli wrote:
>>>> So I never found a relation to the symptom reported of VM kernel
>>>> threads going weird, with KVM optimal handling of kmap ptes.
>>>>   
>>>
>>> The problem is this code:
>>>
>>> static int scan_active_list(struct zone_struct * zone, int age,
>>>                struct list_head * list)
>>> {
>>>        struct list_head *page_lru , *next;
>>>        struct page * page;
>>>        int over_rsslimit;
>>>
>>>        /* Take the lock while messing with the list... */
>>>        lru_lock(zone);
>>>        list_for_each_safe(page_lru, next, list) {
>>>                page = list_entry(page_lru, struct page, lru);
>>>                pte_chain_lock(page);
>>>                if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
>>>                        age_page_up_nolock(page, age);
>>>                pte_chain_unlock(page);
>>>        }
>>>        lru_unlock(zone);
>>>        return 0;
>>> }
>>> If the pages in the list are in the same order as in the ptes (which is
>>> very likely), then we have the following access pattern
> 
> Yes it is likely.
> 
>>> - set up kmap to point at pte
>>> - test_and_clear_bit(pte)
>>> - kunmap
>>>
>>> From kvm's point of view this looks like
>>>
>>> - several accesses to set up the kmap
> 
> Hmm, the kmap establishment takes a single guest operation in the
> fixmap area. That's a single write to the pte, to write a pte_t 8/4
> byte large region (PAE/non-PAE). The same pte_t is then cleared and
> flushed out of the tlb with a cpu-local invlpg during kunmap_atomic.
> 
> I count 1 write here so far.
> 
>>>  - if these accesses trigger flooding, we will have to tear down the
>>> shadow for this page, only to set it up again soon
> 
> So the shadow mapping the fixmap area would be tear down by the
> flooding.
> 
> Or is the shadow corresponding to the real user pte pointed by the
> fixmap, that is unshadowed by the flooding, or both/all?
> 
>>> - an access to the pte (emulted)
> 
> Here I count the second write and this isn't done on the fixmap area
> like the first write above, but this is a write to the real user pte,
> pointed by the fixmap. So if this is emulated it means the shadow of
> the user pte pointing to the real data page is still active.
> 
>>>  - if this access _doesn't_ trigger flooding, we will have 512 unneeded
>>> emulations.  The pte is worthless anyway since the accessed bit is clear
>>> (so we can't set up a shadow pte for it)
>>>    - this bug was fixed
> 
> You mean the accessed bit on fixmap pte used by kmap? Or the user pte
> pointed by the fixmap pte?
> 
>>> - an access to tear down the kmap
> 
> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
> matters).
> 
>>> [btw, am I reading this right? the entire list is scanned each time?
> 
> If the list parameter isn't a local LIST_HEAD on the stack but the
> global one it's a full scan each time. I guess it's the global list
> looking at the new code at the top that has a kswapd_scan_limit
> sysctl.
> 
>>> if you have 1G of active HIGHMEM, that's a quarter of a million pages,
>>> which would take at least a second no matter what we do.  VMware can
>>> probably special-case kmaps, but we can't]
> 
> Perhaps they've a list per-age bucket or similar but still I doubt
> this works well on host either... I guess the virtualization overhead
> is exacerbating the inefficiency. Perhaps killall -STOP kscand is good
> enough fix ;). This seem to only push the age up, to be functional the
> age has to go down and I guess the go-down is done by other threads so
> stopping kscand may not hurt.
> 
> I think what we should aim for is to quickly reach this condition:
> 
> 1) always keep the fixmap/kmap pte_t shadowed and emulate the
>    kmap/kunmap access so the test_and_clear_young done on the user pte
>    doesn't require to re-establish the spte representing the fixmap
>    virtual address. If we don't emulate fixmap we'll have to
>    re-establish the spte during the write to the user pte, and
>    tear it down again during kunmap_atomic. So there's not much doubt
>    fixmap access emulation is worth it.
> 
> 2) get rid of the user pte shadow mapping pointing to the user data so
>    the test_and_clear of the young bitflag on the user pte will not be
>    emulated and it'll run at full CPU speed through the shadow pte
>    mapping corresponding to the fixmap virtual address
> 
> kscand pattern is the same as running mprotect on a 32bit 2.6
> kernel so it sounds worth optimizing for it, even if kscand may be
> unfixable without killall -STOP kscand or VM fixes to guest.
> 
> However I'm not sure about point 2 at the light of mprotect. With
> mprotect the guest virutal addresses mapped by the guest user ptes
> will be used. It's not like kscand that may write forever to the user
> ptes without ever using the guest virtual addresses that they're
> mapping. So we better be sure that by unshadowing and optimizing
> kscand we're not hurting mprotect or other pte mangling operations in
> 2.6 that will likely keep accessing the guest virtual addresses mapped
> by the user ptes previously modified.
> 
> Hope this makes any sense, I'm not sure to understand this completely.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 14:35                                                             ` Avi Kivity
@ 2008-05-28 19:49                                                               ` David S. Ahern
  2008-05-29  6:37                                                                 ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-28 19:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm


I have a clone of the kvm repository, but evidently not running the
right magic to see the changes in the per-page-pte-tracking branch.  I
ran the following:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
git branch per-page-pte-tracking

[dsa@daahern-lx kvm]$ git branch
  master
* per-page-pte-tracking

But arch/x86/kvm/mmu.c does not show the changes for the
per-page-pte-history.patch.

What I am not doing correctly here?

david



Avi Kivity wrote:
> David S. Ahern wrote:
>> Weird. Could it be something about the hosts?
>>
>> I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13
>> GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel.
>>
>> I'll rebuild kvm-69 with your latest patch and try the test programs
>> again.
>>   
> 
> I've pushed it into kvm.git, branch name per-page-pte-tracking.
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 19:49                                                               ` David S. Ahern
@ 2008-05-29  6:37                                                                 ` Avi Kivity
  0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-29  6:37 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

David S. Ahern wrote:
> I have a clone of the kvm repository, but evidently not running the
> right magic to see the changes in the per-page-pte-tracking branch.  I
> ran the following:
>
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
> git branch per-page-pte-tracking
>
> [dsa@daahern-lx kvm]$ git branch
>   master
> * per-page-pte-tracking
>
> But arch/x86/kvm/mmu.c does not show the changes for the
> per-page-pte-history.patch.
>
> What I am not doing correctly here?
>
>   

'git branch' creates a new branch.  Try the following

  git fetch origin
  git checkout origin/per-page-pte-tracking

If that doesn't work (old git) try

  git fetch git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git 
per-page-pte-tracking:refs/heads/per-page-pte-tracking
  git checkout per-page-pte-tracking

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 17:04                                                                   ` Andrea Arcangeli
  2008-05-28 17:24                                                                     ` David S. Ahern
@ 2008-05-29 10:01                                                                     ` Avi Kivity
  2008-05-29 14:27                                                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 10:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David S. Ahern, kvm

Andrea Arcangeli wrote:
>
>>> - set up kmap to point at pte
>>> - test_and_clear_bit(pte)
>>> - kunmap
>>>
>>> From kvm's point of view this looks like
>>>
>>> - several accesses to set up the kmap
>>>       
>
> Hmm, the kmap establishment takes a single guest operation in the
> fixmap area. That's a single write to the pte, to write a pte_t 8/4
> byte large region (PAE/non-PAE). The same pte_t is then cleared and
> flushed out of the tlb with a cpu-local invlpg during kunmap_atomic.
>
> I count 1 write here so far.
>
>   

No, two:

static inline void set_pte(pte_t *ptep, pte_t pte)
{
        ptep->pte_high = pte.pte_high;
        smp_wmb();
        ptep->pte_low = pte.pte_low;
}


>>>  - if these accesses trigger flooding, we will have to tear down the
>>> shadow for this page, only to set it up again soon
>>>       
>
> So the shadow mapping the fixmap area would be tear down by the
> flooding.
>   

Before we started patching this, yes.

> Or is the shadow corresponding to the real user pte pointed by the
> fixmap, that is unshadowed by the flooding, or both/all?
>
>   

After we started patching this, no, but with per-page-pte-history, yes 
(correctly).

>>> - an access to the pte (emulted)
>>>       
>
> Here I count the second write and this isn't done on the fixmap area
> like the first write above, but this is a write to the real user pte,
> pointed by the fixmap. So if this is emulated it means the shadow of
> the user pte pointing to the real data page is still active.
>   

Right.  But if we are scanning a page table linearly, it should be 
unshadowed.

>   
>>>  - if this access _doesn't_ trigger flooding, we will have 512 unneeded
>>> emulations.  The pte is worthless anyway since the accessed bit is clear
>>> (so we can't set up a shadow pte for it)
>>>    - this bug was fixed
>>>       
>
> You mean the accessed bit on fixmap pte used by kmap? Or the user pte
> pointed by the fixmap pte?
>   

The user pte.  After guest code runs test_and_clear_bit(accessed_bit, 
ptep), we can't shadow that pte (all shadowed ptes must have the 
accessed bit set in the corresponding guest pte, similar to how a tlb 
entry can only exist if the accessed bit is set).

>   
>>> - an access to tear down the kmap
>>>       
>
> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
> matters).
>   

Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set.

> I think what we should aim for is to quickly reach this condition:
>
> 1) always keep the fixmap/kmap pte_t shadowed and emulate the
>    kmap/kunmap access so the test_and_clear_young done on the user pte
>    doesn't require to re-establish the spte representing the fixmap
>    virtual address. If we don't emulate fixmap we'll have to
>    re-establish the spte during the write to the user pte, and
>    tear it down again during kunmap_atomic. So there's not much doubt
>    fixmap access emulation is worth it.
>   

That is what is done by current HEAD.  
418c6952ba9fd379059ed325ea5a3efe904fb7fd is responsible.

Note that there is an alternative: allow the kmap pte to be unshadowed, 
and instead emulate the access through that pte (i.e. emulate the btc 
instruction).  I don't think it's worth it though because it hurts other 
users of the fixmap page.
> 2) get rid of the user pte shadow mapping pointing to the user data so
>    the test_and_clear of the young bitflag on the user pte will not be
>    emulated and it'll run at full CPU speed through the shadow pte
>    mapping corresponding to the fixmap virtual address
>   

That's what per-page-pte-history is supposed to do.  The first few 
accesses are emulated, the next will be native.

It's still not full speed as the kmap setup has to be emulated (twice).

One possible optimization is that if we see the first part of the kmap 
instantiation, we emulate a few more instructions before returning to 
the guest.  Xen does this IIRC.


> kscand pattern is the same as running mprotect on a 32bit 2.6
> kernel so it sounds worth optimizing for it, even if kscand may be
> unfixable without killall -STOP kscand or VM fixes to guest.
>
>   

I'm no longer sure the access pattern is sequential, since I see 
kmap_atomic() will not recreate the pte if its value has not changed 
(unless HIGHMEM_DEBUG).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 15:39                                                                 ` David S. Ahern
@ 2008-05-29 11:49                                                                   ` Avi Kivity
  2008-05-29 12:10                                                                   ` Avi Kivity
  1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 11:49 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Andrea Arcangeli, kvm

David S. Ahern wrote:
> I've been instrumenting the guest kernel as well. It's the scanning of
> the active lists that triggers a lot of calls to paging64_prefetch_page,
> and, as you guys know, correlates with the number of direct pages in the
> list. Earlier in this thread I traced the kvm cycles to
> paging64_prefetch_page(). See
>   

I optimized this function a bit, hopefully it will relieve some of the 
pain.  We still need to reduce the number of times it is called.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 15:39                                                                 ` David S. Ahern
  2008-05-29 11:49                                                                   ` Avi Kivity
@ 2008-05-29 12:10                                                                   ` Avi Kivity
  2008-05-29 13:49                                                                     ` David S. Ahern
  1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 12:10 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Andrea Arcangeli, kvm

David S. Ahern wrote:
> I've been instrumenting the guest kernel as well. It's the scanning of
> the active lists that triggers a lot of calls to paging64_prefetch_page,
> and, as you guys know, correlates with the number of direct pages in the
> list. Earlier in this thread I traced the kvm cycles to
> paging64_prefetch_page(). See
>
> http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg16332.html
>
> In the guest I started capturing scans (kscand() loop) that took longer
> than a jiffie. Here's an example for 1 trip through the active lists,
> both anonymous and cache:
>
> active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct
> 36234, dj 225
>
>   

HZ=512, so half a second.

41K pages in 0.5s -> 80K pages/sec.  Considering we have _at_least_ two 
emulations per page, this is almost reasonable.

> active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct 1249, dj 3
>
> active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct
> 84829, dj 848
>   

Here we scanned 100K pages in ~2 seconds.  50K pages/sec, not too good.

> I'll pull down the git branch and give it a spin.
>   

I've rebased it again to include the prefetch_page optimization.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-29 12:10                                                                   ` Avi Kivity
@ 2008-05-29 13:49                                                                     ` David S. Ahern
  2008-05-29 14:08                                                                       ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-29 13:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andrea Arcangeli, kvm

This is 2.4/RHEL3, so HZ=100. 848 jiffies = 8.48 seconds -- and that's
just the one age bucket and this is just one example pulled randomly
(well after boot). During that time kscand does get scheduled out, but
ultimately guest time is at 100% during the scans.

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> I've been instrumenting the guest kernel as well. It's the scanning of
>> the active lists that triggers a lot of calls to paging64_prefetch_page,
>> and, as you guys know, correlates with the number of direct pages in the
>> list. Earlier in this thread I traced the kvm cycles to
>> paging64_prefetch_page(). See
>>
>> http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg16332.html
>>
>> In the guest I started capturing scans (kscand() loop) that took longer
>> than a jiffie. Here's an example for 1 trip through the active lists,
>> both anonymous and cache:
>>
>> active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct
>> 36234, dj 225
>>
>>   
> 
> HZ=512, so half a second.
> 
> 41K pages in 0.5s -> 80K pages/sec.  Considering we have _at_least_ two
> emulations per page, this is almost reasonable.
> 
>> active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct
>> 1249, dj 3
>>
>> active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct
>> 84829, dj 848
>>   
> 
> Here we scanned 100K pages in ~2 seconds.  50K pages/sec, not too good.
> 
>> I'll pull down the git branch and give it a spin.
>>   
> 
> I've rebased it again to include the prefetch_page optimization.
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-29 13:49                                                                     ` David S. Ahern
@ 2008-05-29 14:08                                                                       ` Avi Kivity
  0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 14:08 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Andrea Arcangeli, kvm

David S. Ahern wrote:
> This is 2.4/RHEL3, so HZ=100. 848 jiffies = 8.48 seconds -- and that's
> just the one age bucket and this is just one example pulled randomly
> (well after boot). During that time kscand does get scheduled out, but
> ultimately guest time is at 100% during the scans.
>
>   

Er, yes.  Don't know where that CONFIG_HZ=512 came from in the centos 
config files:

That's pretty bad, then.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-29 10:01                                                                     ` Avi Kivity
@ 2008-05-29 14:27                                                                       ` Andrea Arcangeli
  2008-05-29 15:11                                                                         ` David S. Ahern
  2008-05-29 15:16                                                                         ` Avi Kivity
  0 siblings, 2 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2008-05-29 14:27 UTC (permalink / raw)
  To: Avi Kivity; +Cc: David S. Ahern, kvm

On Thu, May 29, 2008 at 01:01:06PM +0300, Avi Kivity wrote:
> No, two:
>
> static inline void set_pte(pte_t *ptep, pte_t pte)
> {
>        ptep->pte_high = pte.pte_high;
>        smp_wmb();
>        ptep->pte_low = pte.pte_low;
> }

Right, that can be 2 or 1 depending on PAE non-PAE, other 2.4
enterprise distro with pte-highmem ships non-PAE kernels by default.

>>>>  - if these accesses trigger flooding, we will have to tear down the
>>>> shadow for this page, only to set it up again soon
>>>>       
>>
>> So the shadow mapping the fixmap area would be tear down by the
>> flooding.
>>   
>
> Before we started patching this, yes.

Ok so now the one/two writes to the guest fixmap virt address are
emulated and the spte isn't tear down.

>
>> Or is the shadow corresponding to the real user pte pointed by the
>> fixmap, that is unshadowed by the flooding, or both/all?
>>
>>   
>
> After we started patching this, no, but with per-page-pte-history, yes 
> (correctly).

So with the per-page-pte-history the shadow representing the guest
user pte that is being modified by page_referenced is unshadowed.

>>>> - an access to the pte (emulted)
>>>>       
>>
>> Here I count the second write and this isn't done on the fixmap area
>> like the first write above, but this is a write to the real user pte,
>> pointed by the fixmap. So if this is emulated it means the shadow of
>> the user pte pointing to the real data page is still active.
>>   
>
> Right.  But if we are scanning a page table linearly, it should be 
> unshadowed.

I think we're often not scanning page table linearly with pte_chains,
but yet those should be still unshadowed. mmaps won't always bring
memory in linear order, memory isn't always initialized or by memset
or pagedin with contiguous virtual accesses.

So while the assumption that following the active list will sometime
return guest ptes that maps contiguous guest virtual address is valid,
it only accounts for a small percentage of the active list. It largely
depends on the userland apps. Furthermore even if the active lru is
initially pointing to linear ptes, the list is then split into age
buckets depending on the access patterns at runtime, so that further
fragments the linearity of the virtual addresses of the kmapped ptes.

BTW, one thing we didn't account for in previous email, is that there
can be more than one guest user pte modified by page_referenced, if
it's not a direct page. And non direct pages surely won't provide
linear scans, infact for non linear pages the most common is that the
pte_t will point to the same virtual address but on a different
pgd_t * (and in turn on a different pmd_t).

>>>>  - if this access _doesn't_ trigger flooding, we will have 512 unneeded
>>>> emulations.  The pte is worthless anyway since the accessed bit is clear
>>>> (so we can't set up a shadow pte for it)
>>>>    - this bug was fixed
>>>>       
>>
>> You mean the accessed bit on fixmap pte used by kmap? Or the user pte
>> pointed by the fixmap pte?
>>   
>
> The user pte.  After guest code runs test_and_clear_bit(accessed_bit, 
> ptep), we can't shadow that pte (all shadowed ptes must have the accessed 
> bit set in the corresponding guest pte, similar to how a tlb entry can only 
> exist if the accessed bit is set).

Is this software invariant to ensure that we'll refresh the accessed
bit on the user pte too?

I assume this is needed because otherwise if we clear the accessed bit
on the shadow pte and we clear it on the user pte, when the shadow is
mapped in the TLB again the accessed bit will be set on the shadow in
hardware, but not on the user pte because the accessed bit is set on
the spte without kvm page fault.

So this means kscand by clearing the accessed bitflag on them, should
automatically unshadowing all user ptes pointed by the fixmap pte.

So a secnd test_and_clear_bit on the same user pte will run through
the fixmap pte established by kmap_atomic without traps.

So this means when the user program run again, it'll find the user pte
unshadowed and it'll have to re-instantiate the shadow ptes with a kvm
page fault, that has the primary objective of marking the user pte
accessed again (to notify the next kscand pass that the data page
pointed by the user pte was used meanwhile).

If I understand correctly, the establishment of the shadow pte
corresponding to the user pte, will have to mark wrprotect the spte
corresponding to the fixmap pte because we need to intercept
modifications to shadowed guest ptes and the spte corresponding to the
fixmap guest pte is now pointing to a shadowed guest pte after the
program returns running.

Then when kscand runs again, for the pages that have been faulted in
by the user program, we'll trap the test_and_clear_bit happening
through the readonly spte corresponding to the fixmap guest pte, and
we'll unshadow the spte of the guest user pte again and we'll mark the
spte corresponding to the fixmap pte as read-write again, because of
the test_and_clear_bit tells us that we've to unshadow instead of
emulating.

>>>> - an access to tear down the kmap
>>>>       
>>
>> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
>> matters).
>>   
>
> Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set.

2.4 yes. 2.6 is will do similar to CONFIG_HIGHMEM_DEBUG.

2.4 without HIGHMEM_DEBUG sets the pte and invlpg in kmap_atomic and
does nothing in kunmap_atomic.

2.6 sets the pte in kmap_atomic, and clears it+invlpg in kunmap_atomic.

>> I think what we should aim for is to quickly reach this condition:
>>
>> 1) always keep the fixmap/kmap pte_t shadowed and emulate the
>>    kmap/kunmap access so the test_and_clear_young done on the user pte
>>    doesn't require to re-establish the spte representing the fixmap
>>    virtual address. If we don't emulate fixmap we'll have to
>>    re-establish the spte during the write to the user pte, and
>>    tear it down again during kunmap_atomic. So there's not much doubt
>>    fixmap access emulation is worth it.
>>   
>
> That is what is done by current HEAD.  
> 418c6952ba9fd379059ed325ea5a3efe904fb7fd is responsible.

Cool!

>
> Note that there is an alternative: allow the kmap pte to be unshadowed, and 
> instead emulate the access through that pte (i.e. emulate the btc 
> instruction).  I don't think it's worth it though because it hurts other 
> users of the fixmap page.
>> 2) get rid of the user pte shadow mapping pointing to the user data so
>>    the test_and_clear of the young bitflag on the user pte will not be
>>    emulated and it'll run at full CPU speed through the shadow pte
>>    mapping corresponding to the fixmap virtual address
>>   
>
> That's what per-page-pte-history is supposed to do.  The first few accesses 
> are emulated, the next will be native.

Why not to go native immediately when we notice a test_and_clear of
the accessed bit? First the ptes won't be in contiguous virtual
address order, so if the flooding of the sptes corresponding to the
guest user pte depends on the gpa of the guest user ptes being
contiguous it won't work well. But more importantly we've found a
test_and_clear_bit of the accessed bitflag, so we should unshadow the
user pte that is being marked "old" immediately without need to detect
any flooding.

> It's still not full speed as the kmap setup has to be emulated (twice).

Agreed, the 1/2/3 emulations on writes to the fixmap area during
kmap_atomic (1/2 for non-PAE/PAE and 1 further pte_clear on 2.6 or 2.4
debug-highmem) seems unavoidable.

But the test_and_clear_bit writprotect fault (when the guest user pte
is shadowed) should just unshadow the guest user pte, mark the spte
representing the fixmap pte as writeable, and return immediately to
guest mode to actually run test_and_clear_bit natively without writing
it through emulation.

Noticing the test_and_clear_bit also requires a bit of instruction
"detection", but once we detected it from the eip address, we don't
have to write anything to the guest.

But I guess I'm missing something...

> One possible optimization is that if we see the first part of the kmap 
> instantiation, we emulate a few more instructions before returning to the 
> guest.  Xen does this IIRC.

Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not
sure if 32bit PAE is that important to do this. Most 32bit enterprise
kernels I worked aren't compiled with PAE, only one called bigsmp is.

Also on 2.6, we could get the same benefit by making 2.6 at least as
optimal as 2.4 by never clearing the fixmap pte and by doing invlpg
only after setting it to a new value. Xen can't optimize that write in
kunmap_atomic.

2.6 has debug enabled by default for no good reason. So that would be
the first optimization to do as it saves a few cycles per
kunmap_atomic on host too.

> I'm no longer sure the access pattern is sequential, since I see 
> kmap_atomic() will not recreate the pte if its value has not changed 
> (unless HIGHMEM_DEBUG).

Hmm kmap_atomic always writes a new value to the fixmap pte, even if
it was mapping the same user pte as before.

static inline void *kmap_atomic(struct page *page, enum km_type type)
{
	enum fixed_addresses idx;
	unsigned long vaddr;

	if (page < highmem_start_page)
	   return page_address(page);

	   idx = type + KM_TYPE_NR*smp_processor_id();
	   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
#if HIGHMEM_DEBUG
    if (!pte_none(*(kmap_pte-idx)))
       out_of_line_bug();
#endif
	set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
	__flush_tlb_one(vaddr);

	return (void*) vaddr;
}

In 2.6 does too, because it does the debug pte_clear in kunmap_atomic.

In theory even host could do pte_same() and avoid an invlpg if it
didn't change, but I'm unsure how frequently we remap the same page,
the pte loops like mprotect will map the 4k large pte, and loop over
it once it's mapped by the fixmap virtual address. So frequent
repetitions of remapping of the same page with kmap_atomic sounds
unlikely.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-29 14:27                                                                       ` Andrea Arcangeli
@ 2008-05-29 15:11                                                                         ` David S. Ahern
  2008-05-29 15:16                                                                         ` Avi Kivity
  1 sibling, 0 replies; 73+ messages in thread
From: David S. Ahern @ 2008-05-29 15:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Avi Kivity, kvm


Andrea Arcangeli wrote:
> On Thu, May 29, 2008 at 01:01:06PM +0300, Avi Kivity wrote:
>> No, two:
>>
>> static inline void set_pte(pte_t *ptep, pte_t pte)
>> {
>>        ptep->pte_high = pte.pte_high;
>>        smp_wmb();
>>        ptep->pte_low = pte.pte_low;
>> }
> 
> Right, that can be 2 or 1 depending on PAE non-PAE, other 2.4
> enterprise distro with pte-highmem ships non-PAE kernels by default.

RHEL3U8 has CONFIG_X86_PAE set.

<snipped>

>>>>> - an access to tear down the kmap
>>>>>       
>>> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
>>> matters).
>>>   
>> Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set.
> 
> 2.4 yes. 2.6 is will do similar to CONFIG_HIGHMEM_DEBUG.
> 
> 2.4 without HIGHMEM_DEBUG sets the pte and invlpg in kmap_atomic and
> does nothing in kunmap_atomic.
> 
> 2.6 sets the pte in kmap_atomic, and clears it+invlpg in kunmap_atomic.

CONFIG_DEBUG_HIGHMEM is set.

<snipped>

>> One possible optimization is that if we see the first part of the kmap 
>> instantiation, we emulate a few more instructions before returning to the 
>> guest.  Xen does this IIRC.
> 
> Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not
> sure if 32bit PAE is that important to do this. Most 32bit enterprise
> kernels I worked aren't compiled with PAE, only one called bigsmp is.

RHEL3 has a hugemem kernel which basically just enables the 4G/4G split.
My guest with the hugemem kernel runs much better than the standard smp
kernel.


If you care to download it the RHEL3U8 kernel source is posted here:
ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/3AS/en/os/SRPMS/kernel-2.4.21-47.EL.src.rpm

Red Hat does heavily patch kernels, so they will be dramatically
different than the kernel.org kernel with the same number.

david

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-29 14:27                                                                       ` Andrea Arcangeli
  2008-05-29 15:11                                                                         ` David S. Ahern
@ 2008-05-29 15:16                                                                         ` Avi Kivity
  2008-05-30 13:12                                                                           ` Andrea Arcangeli
  1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 15:16 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David S. Ahern, kvm

Andrea Arcangeli wrote:
>>> Here I count the second write and this isn't done on the fixmap area
>>> like the first write above, but this is a write to the real user pte,
>>> pointed by the fixmap. So if this is emulated it means the shadow of
>>> the user pte pointing to the real data page is still active.
>>>   
>>>       
>> Right.  But if we are scanning a page table linearly, it should be 
>> unshadowed.
>>     
>
> I think we're often not scanning page table linearly with pte_chains,
> but yet those should be still unshadowed. mmaps won't always bring
> memory in linear order, memory isn't always initialized or by memset
> or pagedin with contiguous virtual accesses.
>
>   

I guess we aren't scanning the page table linerarly, since with the 
linear-scan test case I can't reproduce the problem.

> So while the assumption that following the active list will sometime
> return guest ptes that maps contiguous guest virtual address is valid,
> it only accounts for a small percentage of the active list. It largely
> depends on the userland apps. Furthermore even if the active lru is
> initially pointing to linear ptes, the list is then split into age
> buckets depending on the access patterns at runtime, so that further
> fragments the linearity of the virtual addresses of the kmapped ptes.
>
> BTW, one thing we didn't account for in previous email, is that there
> can be more than one guest user pte modified by page_referenced, if
> it's not a direct page. And non direct pages surely won't provide
> linear scans, infact for non linear pages the most common is that the
> pte_t will point to the same virtual address but on a different
> pgd_t * (and in turn on a different pmd_t).
>
>   

Since the pte tracking is per-page, it won't be affected by shared pages.

>>> You mean the accessed bit on fixmap pte used by kmap? Or the user pte
>>> pointed by the fixmap pte?
>>>   
>>>       
>> The user pte.  After guest code runs test_and_clear_bit(accessed_bit, 
>> ptep), we can't shadow that pte (all shadowed ptes must have the accessed 
>> bit set in the corresponding guest pte, similar to how a tlb entry can only 
>> exist if the accessed bit is set).
>>     
>
> Is this software invariant to ensure that we'll refresh the accessed
> bit on the user pte too?
>
>   

Yes.  We need a fault in order to set the guest accessed bit.

> So this means kscand by clearing the accessed bitflag on them, should
> automatically unshadowing all user ptes pointed by the fixmap pte.
>
> So a secnd test_and_clear_bit on the same user pte will run through
> the fixmap pte established by kmap_atomic without traps.
>
> So this means when the user program run again, it'll find the user pte
> unshadowed and it'll have to re-instantiate the shadow ptes with a kvm
> page fault, that has the primary objective of marking the user pte
> accessed again (to notify the next kscand pass that the data page
> pointed by the user pte was used meanwhile).
>
> If I understand correctly, the establishment of the shadow pte
> corresponding to the user pte, will have to mark wrprotect the spte
> corresponding to the fixmap pte because we need to intercept
> modifications to shadowed guest ptes and the spte corresponding to the
> fixmap guest pte is now pointing to a shadowed guest pte after the
> program returns running.
>
> Then when kscand runs again, for the pages that have been faulted in
> by the user program, we'll trap the test_and_clear_bit happening
> through the readonly spte corresponding to the fixmap guest pte, and
> we'll unshadow the spte of the guest user pte again and we'll mark the
> spte corresponding to the fixmap pte as read-write again, because of
> the test_and_clear_bit tells us that we've to unshadow instead of
> emulating.
>   

Yes.

>>> 2) get rid of the user pte shadow mapping pointing to the user data so
>>>    the test_and_clear of the young bitflag on the user pte will not be
>>>    emulated and it'll run at full CPU speed through the shadow pte
>>>    mapping corresponding to the fixmap virtual address
>>>   
>>>       
>> That's what per-page-pte-history is supposed to do.  The first few accesses 
>> are emulated, the next will be native.
>>     
>
> Why not to go native immediately when we notice a test_and_clear of
> the accessed bit? First the ptes won't be in contiguous virtual
> address order, so if the flooding of the sptes corresponding to the
> guest user pte depends on the gpa of the guest user ptes being
> contiguous it won't work well. But more importantly we've found a
> test_and_clear_bit of the accessed bitflag, so we should unshadow the
> user pte that is being marked "old" immediately without need to detect
> any flooding.
>   

Unshadowing a page is expensive, both in immediate cost, and in future 
cost of reshadowing the page and taking faults.  It's worthwhile to be 
sure the guest really doesn't want it as a page table.


>> It's still not full speed as the kmap setup has to be emulated (twice).
>>     
>
> Agreed, the 1/2/3 emulations on writes to the fixmap area during
> kmap_atomic (1/2 for non-PAE/PAE and 1 further pte_clear on 2.6 or 2.4
> debug-highmem) seems unavoidable.
>
> But the test_and_clear_bit writprotect fault (when the guest user pte
> is shadowed) should just unshadow the guest user pte, mark the spte
> representing the fixmap pte as writeable, and return immediately to
> guest mode to actually run test_and_clear_bit natively without writing
> it through emulation.
>
> Noticing the test_and_clear_bit also requires a bit of instruction
> "detection", but once we detected it from the eip address, we don't
> have to write anything to the guest.
>
> But I guess I'm missing something...
>
>   

If the pages are not scanned linearly, then unshadowing may not help.

Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables.  
Well, then after 4000 scans we ought to have unshadowed everything.  So 
I guess per-page-pte-history is broken, can't explain it otherwise.

>> One possible optimization is that if we see the first part of the kmap 
>> instantiation, we emulate a few more instructions before returning to the 
>> guest.  Xen does this IIRC.
>>     
>
> Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not
> sure if 32bit PAE is that important to do this. Most 32bit enterprise
> kernels I worked aren't compiled with PAE, only one called bigsmp is.
>
>   

Well, seems RHEL 3.8 smp is PAE.

> Also on 2.6, we could get the same benefit by making 2.6 at least as
> optimal as 2.4 by never clearing the fixmap pte and by doing invlpg
> only after setting it to a new value. Xen can't optimize that write in
> kunmap_atomic.
>
> 2.6 has debug enabled by default for no good reason. So that would be
> the first optimization to do as it saves a few cycles per
> kunmap_atomic on host too.
>
>   

Yes, it's probably a small win on native as well.

>> I'm no longer sure the access pattern is sequential, since I see 
>> kmap_atomic() will not recreate the pte if its value has not changed 
>> (unless HIGHMEM_DEBUG).
>>     
>
> Hmm kmap_atomic always writes a new value to the fixmap pte, even if
> it was mapping the same user pte as before.
>
> static inline void *kmap_atomic(struct page *page, enum km_type type)
> {
> 	enum fixed_addresses idx;
> 	unsigned long vaddr;
>
> 	if (page < highmem_start_page)
> 	   return page_address(page);
>
> 	   idx = type + KM_TYPE_NR*smp_processor_id();
> 	   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
> #if HIGHMEM_DEBUG
>     if (!pte_none(*(kmap_pte-idx)))
>        out_of_line_bug();
> #endif
> 	set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
> 	__flush_tlb_one(vaddr);
>
> 	return (void*) vaddr;
> }
>
>   

The centos 3.8 sources have


static inline void *__kmap_atomic(struct page *page, enum km_type type)
{
        enum fixed_addresses idx;
        unsigned long vaddr;

        idx = type + KM_TYPE_NR*smp_processor_id();
        vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
#if HIGHMEM_DEBUG
        if (!pte_none(*(kmap_pte-idx)))
                out_of_line_bug();
#else
        /*
         * Performance optimization - do not flush if the new
         * pte is the same as the old one:
         */
        if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot)))
                return (void *) vaddr;
#endif
        set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
        __flush_tlb_one(vaddr);

        return (void*) vaddr;
}


(linux-2.4.21-47.EL)

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-28 10:51                                                         ` Avi Kivity
  2008-05-28 14:13                                                           ` David S. Ahern
@ 2008-05-29 16:42                                                           ` David S. Ahern
  2008-05-31  8:16                                                             ` Avi Kivity
  1 sibling, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-29 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

[-- Attachment #1: Type: text/plain, Size: 5838 bytes --]



Avi Kivity wrote:
> David S. Ahern wrote:
>> The short answer is that I am still see large system time hiccups in the
>> guests due to kscand in the guest scanning its active lists. I do see
>> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
>> completeness I also tried a history of 2, but it performed worse than 3
>> which is no surprise given the meaning of it.)
>>
>>
>> I have been able to scratch out a simplistic program that stimulates
>> kscand activity similar to what is going on in my real guest (see
>> attached). The program requests a memory allocation, initializes it (to
>> get it backed) and then in a loop sweeps through the memory in chunks
>> similar to a program using parts of its memory here and there but
>> eventually accessing all of it.
>>
>> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
>> using a fair amount of highmem. Start a couple of instances of the
>> attached. For example, I've been using these 2:
>>
>>     memuser 768M 120 5 300
>>     memuser 384M 300 10 600
>>
>> Together these instances take up a 1GB of RAM and once initialized
>> consume very little CPU. On kvm they make kscand and kswapd go nuts
>> every 5-15 minutes. For comparison, I do not see the same behavior for
>> an identical setup running on esx 3.5.
>>   
> 
> I haven't been able to reproduce this:
> 
>> [root@localhost root]# ps -elf | grep -E 'memuser|kscand'
>> 1 S root         7     1  1  75   0    -     0 schedu 10:07 ?       
>> 00:00:26 [kscand]
>> 0 S root      1464     1  1  75   0    - 196986 schedu 10:20 pts/0  
>> 00:00:21 ./memuser 768M 120 5 300
>> 0 S root      1465     1  0  75   0    - 98683 schedu 10:20 pts/0   
>> 00:00:10 ./memuser 384M 300 10 600
>> 0 S root      2148  1293  0  75   0    -   922 pipe_w 10:48 pts/0   
>> 00:00:00 grep -E memuser|kscand
> 
> The workload has been running for about half an hour, and kswapd cpu
> usage doesn't seem significant.  This is a 2GB guest running with my
> patch ported to kvm.git HEAD.  Guest is has 2G of memory.
> 

I'm running on the per-page-pte-tracking branch, and I am still seeing it. 

I doubt you want to sit and watch the screen for an hour, so install sysstat if not already, change the sample rate to 1 minute (/etc/cron.d/sysstat), let the server run for a few hours and then run 'sar -u'. You'll see something like this:

10:12:11 AM       LINUX RESTART

10:13:03 AM       CPU     %user     %nice   %system   %iowait     %idle
10:14:01 AM       all      0.08      0.00      2.08      0.35     97.49
10:15:03 AM       all      0.05      0.00      0.79      0.04     99.12
10:15:59 AM       all      0.15      0.00      1.52      0.06     98.27
10:17:01 AM       all      0.04      0.00      0.69      0.04     99.23
10:17:59 AM       all      0.01      0.00      0.39      0.00     99.60
10:18:59 AM       all      0.00      0.00      0.12      0.02     99.87
10:20:02 AM       all      0.18      0.00     14.62      0.09     85.10
10:21:01 AM       all      0.71      0.00     26.35      0.01     72.94
10:22:02 AM       all      0.67      0.00     10.61      0.00     88.72
10:22:59 AM       all      0.14      0.00      1.80      0.00     98.06
10:24:03 AM       all      0.13      0.00      0.50      0.00     99.37
10:24:59 AM       all      0.09      0.00     11.46      0.00     88.45
10:26:03 AM       all      0.16      0.00      0.69      0.03     99.12
10:26:59 AM       all      0.14      0.00     10.01      0.02     89.83
10:28:03 AM       all      0.57      0.00      2.20      0.03     97.20
Average:          all      0.21      0.00      5.55      0.05     94.20


every one of those jumps in %system time directly correlates to kscand activity. Without the memuser programs running the guest %system time is <1%. The point of this silly memuser program is just to use high memory -- let it age, then make it active again, sit idle, repeat. If you run kvm_stat with -l in the host you'll see the jump in pte writes/updates. An intern here added a timestamp to the kvm_stat output for me which helps to directly correlate guest/host data.


I also ran my real guest on the branch. Performance at boot through the first 15 minutes was much better, but I'm still seeing recurring hits every 5 minutes when kscand kicks in. Here's the data from the guest for the first one which happened after 15 minutes of uptime:

active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59

active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 103

active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212

The kvm_stat data for this time period is attached due to line lengths.


Also, I forgot to mention this before, but there is a bug in the kscand code in the RHEL3U8 kernel. When it scans the cache list it uses the count from the anonymous list:

            if (need_active_cache_scan(zone)) {
                for (age = MAX_AGE-1; age >= 0; age--)  {
                    scan_active_list(zone, age,
                        &zone->active_cache_list[age],
                        zone->active_anon_count[age]);
                              ^^^^^^^^^^^^^^^^^
                    if (current->need_resched)
                        schedule();
                }
            }

When the anonymous count is higher it is scanning the cache list repeatedly. An example of that was captured here:

active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon 111967, direct 626, dj 3

count anon is active_anon_count[age] which at this moment was 111,967. There were only 222 entries in the cache list, but the count value passed to scan_active_list was 111,967. When the cache list has a lot of direct pages, that causes a larger hit on kvm than needed. That said, I have to live with the bug in the guest.

david

[-- Attachment #2: kvm_stat.kscand --]
[-- Type: text/plain, Size: 2650 bytes --]

kvm-69/kvm_stat -f 'mmu*|pf*' -l:

 mmio_exit  mmu_cache  mmu_flood  mmu_pde_z  mmu_pte_u  mmu_pte_w  mmu_recyc  mmu_shado   pf_fixed   pf_guest
       182         18         18          0       5664       5682          0         18       5720         21
       211         59         59          0       7040       7105          0         59       7348         99
        81          0         48          0      45861      45909          0         48      45910          1
       209        683        814          0     178527     179405          0        814     181410          9
        67        111        320          0     175602     175922          0        320     177202          0
        28          0         29          0     181365     181394          0         29     181394          0
         7          0         22          0     181834     181856          0         22     181855          0
        35          0         14          0     180129     180143          0         14     180144          0
         7          0         10          0     179141     179151          0         10     179150          0
        35          0          3          0     181359     181361          0          3     181362          0
         7          0          4          0     181565     181570          0          4     181570          0
        21          0          3          0     181435     181437          0          3     181437          0
        21          0          4          0     181281     181286          0          4     181285          0
        21          0          3          0     179444     179447          0          3     179448          0
        91          0         61          0     179841     179902          0         61     179902          0
         7          0        247          0     176628     176875          0        247     176874          0
       313        478        133          1     100486     100604          0        133     126690         80
       162         21         18          0       6361       6379          0         18       6584          5
       294         40         23         21       9144       9188          0         25       9544         45
       143          5          1          0       5026       5027          0          1       5502          1


The above corresponds to the following from the guest:

active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59

active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 103

active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-29 15:16                                                                         ` Avi Kivity
@ 2008-05-30 13:12                                                                           ` Andrea Arcangeli
  2008-05-31  7:39                                                                             ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: Andrea Arcangeli @ 2008-05-30 13:12 UTC (permalink / raw)
  To: Avi Kivity; +Cc: David S. Ahern, kvm

On Thu, May 29, 2008 at 06:16:55PM +0300, Avi Kivity wrote:
> Yes.  We need a fault in order to set the guest accessed bit.

So what I'm missing now is how the spte corresponding to the user pte
that is under test_and_clear to clear the accessed bit, will not the
zapped immediately. If we don't zap it immediately, how do we set the
accessed bit again on the user pte, when the user program returned
running and used that shadow pte to access the program data after the
kscand pass?

Or am I missing something?

> Unshadowing a page is expensive, both in immediate cost, and in future cost 
> of reshadowing the page and taking faults.  It's worthwhile to be sure the 
> guest really doesn't want it as a page table.

Ok that makes sense, but can we defer the unshadowing while still
emulating the accessed bit correctly on the user pte?

> If the pages are not scanned linearly, then unshadowing may not help.

It should help the second time kscand runs, for the user ptes that
aren't shadowed anymore, the second pass won't require any emulation
for test_and_bit because the spte of the fixmap area will be
read-write. The bug that passes the anonymous pages number instead of
the cache number will lead to many more test_and_clear than needed,
and not all user ptes may be used in between two different kscand passes.

> Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables.  

There are likely 1500 ptes in highmem. (ram isn't the most important factor)

> Well, then after 4000 scans we ought to have unshadowed everything.  So I 
> guess per-page-pte-history is broken, can't explain it otherwise.

Yes, we should have unshadowed all user ptes after 4000 scans and then
the test_and_clear shouldn't require any more emulation, there will be
only 3 emulations for each kmap_atomic/kunmap_atomic.

> static inline void *__kmap_atomic(struct page *page, enum km_type type)
> {
>        enum fixed_addresses idx;
>        unsigned long vaddr;
>
>        idx = type + KM_TYPE_NR*smp_processor_id();
>        vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
> #if HIGHMEM_DEBUG
>        if (!pte_none(*(kmap_pte-idx)))
>                out_of_line_bug();
> #else
>        /*
>         * Performance optimization - do not flush if the new
>         * pte is the same as the old one:
>         */
>        if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot)))
>                return (void *) vaddr;
> #endif
>        set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
>        __flush_tlb_one(vaddr);
>
>        return (void*) vaddr;
> }

It's weird they optimized this if they enabled
CONFIG_HIGHMEM_DEBUG. Anyway it doesn't make a whole lot of difference
as it's an unlikely condition.

> (linux-2.4.21-47.EL)

Downloaded it now.

I think it should be clear that by now, we're trying to be
bug-compatile like the host here, and optimizing for 2.6 kmaps.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-30 13:12                                                                           ` Andrea Arcangeli
@ 2008-05-31  7:39                                                                             ` Avi Kivity
  0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-31  7:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David S. Ahern, kvm

Andrea Arcangeli wrote:
> On Thu, May 29, 2008 at 06:16:55PM +0300, Avi Kivity wrote:
>   
>> Yes.  We need a fault in order to set the guest accessed bit.
>>     
>
> So what I'm missing now is how the spte corresponding to the user pte
> that is under test_and_clear to clear the accessed bit, will not the
> zapped immediately. If we don't zap it immediately, how do we set the
> accessed bit again on the user pte, when the user program returned
> running and used that shadow pte to access the program data after the
> kscand pass?
>
>   

The spte is zapped unconditionally in kvm_mmu_pte_write(), and not 
re-established in mmu_pte_write_new_pte() due to the missing accessed bit.

The question is whether to tear down the shadow page it is contained in, 
or not.

> Or am I missing something?
>
>   
>> Unshadowing a page is expensive, both in immediate cost, and in future cost 
>> of reshadowing the page and taking faults.  It's worthwhile to be sure the 
>> guest really doesn't want it as a page table.
>>     
>
> Ok that makes sense, but can we defer the unshadowing while still
> emulating the accessed bit correctly on the user pte?
>
>   

We do, unless there's a bad bug somewhere.

>> If the pages are not scanned linearly, then unshadowing may not help.
>>     
>
> It should help the second time kscand runs, for the user ptes that
> aren't shadowed anymore, the second pass won't require any emulation
> for test_and_bit because the spte of the fixmap area will be
> read-write. The bug that passes the anonymous pages number instead of
> the cache number will lead to many more test_and_clear than needed,
> and not all user ptes may be used in between two different kscand passes.
>
>   

We still need 3 emulations per pte to set the fixmap entry.  Unshadowing 
saves one emulation on the pte itself.


>> Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables.  
>>     
>
> There are likely 1500 ptes in highmem. (ram isn't the most important factor)
>
>   

I use 'pte' in the Intel manual sense (page table entry), not the Linux 
sense (page table).

I mentioned these numbers to see the worst case behavior.

Non-highmem:

   - with unshadow: O(500) accesses to unshadow the page tables, then 
native speed
   - without unshadow: O(250000) accesses to modify the ptes

Highmem:
   - with unshadow: O(250000) accesses to update the fixmap entry
   - with unshadow: O(250000) accesses to update the fixmap entry and to 
modify the ptes
 

>> Well, then after 4000 scans we ought to have unshadowed everything.  So I 
>> guess per-page-pte-history is broken, can't explain it otherwise.
>>     
>
> Yes, we should have unshadowed all user ptes after 4000 scans and then
> the test_and_clear shouldn't require any more emulation, there will be
> only 3 emulations for each kmap_atomic/kunmap_atomic.
>
>   

So we save 25%.  It's still bad even if everything is working correctly.

>
> I think it should be clear that by now, we're trying to be
> bug-compatile like the host here, and optimizing for 2.6 kmaps.
>   

Don't understand.


I'm guessing esx gets its good performance by special-casing something.  
For example, they can keep the fixmap page never shadowed, always 
emulate accesses through the fixmap page, and recompile instructions 
that go through fixmap to issue a hypercall.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-29 16:42                                                           ` David S. Ahern
@ 2008-05-31  8:16                                                             ` Avi Kivity
  2008-06-02 16:42                                                               ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-31  8:16 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

David S. Ahern wrote:
>> I haven't been able to reproduce this:
>>
>>     
>>> [root@localhost root]# ps -elf | grep -E 'memuser|kscand'
>>> 1 S root         7     1  1  75   0    -     0 schedu 10:07 ?       
>>> 00:00:26 [kscand]
>>> 0 S root      1464     1  1  75   0    - 196986 schedu 10:20 pts/0  
>>> 00:00:21 ./memuser 768M 120 5 300
>>> 0 S root      1465     1  0  75   0    - 98683 schedu 10:20 pts/0   
>>> 00:00:10 ./memuser 384M 300 10 600
>>> 0 S root      2148  1293  0  75   0    -   922 pipe_w 10:48 pts/0   
>>> 00:00:00 grep -E memuser|kscand
>>>       
>> The workload has been running for about half an hour, and kswapd cpu
>> usage doesn't seem significant.  This is a 2GB guest running with my
>> patch ported to kvm.git HEAD.  Guest is has 2G of memory.
>>
>>     
>
> I'm running on the per-page-pte-tracking branch, and I am still seeing it. 
>
> I doubt you want to sit and watch the screen for an hour, so install sysstat if not already, change the sample rate to 1 minute (/etc/cron.d/sysstat), let the server run for a few hours and then run 'sar -u'. You'll see something like this:
>
> 10:12:11 AM       LINUX RESTART
>
> 10:13:03 AM       CPU     %user     %nice   %system   %iowait     %idle
> 10:14:01 AM       all      0.08      0.00      2.08      0.35     97.49
> 10:15:03 AM       all      0.05      0.00      0.79      0.04     99.12
> 10:15:59 AM       all      0.15      0.00      1.52      0.06     98.27
> 10:17:01 AM       all      0.04      0.00      0.69      0.04     99.23
> 10:17:59 AM       all      0.01      0.00      0.39      0.00     99.60
> 10:18:59 AM       all      0.00      0.00      0.12      0.02     99.87
> 10:20:02 AM       all      0.18      0.00     14.62      0.09     85.10
> 10:21:01 AM       all      0.71      0.00     26.35      0.01     72.94
> 10:22:02 AM       all      0.67      0.00     10.61      0.00     88.72
> 10:22:59 AM       all      0.14      0.00      1.80      0.00     98.06
> 10:24:03 AM       all      0.13      0.00      0.50      0.00     99.37
> 10:24:59 AM       all      0.09      0.00     11.46      0.00     88.45
> 10:26:03 AM       all      0.16      0.00      0.69      0.03     99.12
> 10:26:59 AM       all      0.14      0.00     10.01      0.02     89.83
> 10:28:03 AM       all      0.57      0.00      2.20      0.03     97.20
> Average:          all      0.21      0.00      5.55      0.05     94.20
>
>
> every one of those jumps in %system time directly correlates to kscand activity. Without the memuser programs running the guest %system time is <1%. The point of this silly memuser program is just to use high memory -- let it age, then make it active again, sit idle, repeat. If you run kvm_stat with -l in the host you'll see the jump in pte writes/updates. An intern here added a timestamp to the kvm_stat output for me which helps to directly correlate guest/host data.
>
>
> I also ran my real guest on the branch. Performance at boot through the first 15 minutes was much better, but I'm still seeing recurring hits every 5 minutes when kscand kicks in. Here's the data from the guest for the first one which happened after 15 minutes of uptime:
>
> active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59
>
> active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 103
>
> active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212
>
>   

We touched 90,000 ptes in 12 seconds.  That's 8,000 ptes per second.  
Yet we see 180,000 page faults per second in the trace.

Oh!  Only 45K pages were direct, so the other 45K were shared, with 
perhaps many ptes.  We shoud count ptes, not pages.

Can you modify page_referenced() to count the numbers of ptes mapped (1 
for direct pages, nr_chains for indirect pages) and print the total 
deltas in active_anon_scan?

> The kvm_stat data for this time period is attached due to line lengths.
>
>
> Also, I forgot to mention this before, but there is a bug in the kscand code in the RHEL3U8 kernel. When it scans the cache list it uses the count from the anonymous list:
>
>             if (need_active_cache_scan(zone)) {
>                 for (age = MAX_AGE-1; age >= 0; age--)  {
>                     scan_active_list(zone, age,
>                         &zone->active_cache_list[age],
>                         zone->active_anon_count[age]);
>                               ^^^^^^^^^^^^^^^^^
>                     if (current->need_resched)
>                         schedule();
>                 }
>             }
>
> When the anonymous count is higher it is scanning the cache list repeatedly. An example of that was captured here:
>
> active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon 111967, direct 626, dj 3
>
> count anon is active_anon_count[age] which at this moment was 111,967. There were only 222 entries in the cache list, but the count value passed to scan_active_list was 111,967. When the cache list has a lot of direct pages, that causes a larger hit on kvm than needed. That said, I have to live with the bug in the guest.
>   

For debugging, can you fix it?  It certainly has a large impact.

Perhaps it is fixed in an update kernel.  There's a 2.4.21-50.EL in the 
centos 3.8 update repos.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-05-31  8:16                                                             ` Avi Kivity
@ 2008-06-02 16:42                                                               ` David S. Ahern
  2008-06-05  8:37                                                                 ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-06-02 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm


Avi Kivity wrote:
> David S. Ahern wrote:
>>> I haven't been able to reproduce this:
>>>
>>>    
>>>> [root@localhost root]# ps -elf | grep -E 'memuser|kscand'
>>>> 1 S root         7     1  1  75   0    -     0 schedu 10:07 ?      
>>>> 00:00:26 [kscand]
>>>> 0 S root      1464     1  1  75   0    - 196986 schedu 10:20 pts/0 
>>>> 00:00:21 ./memuser 768M 120 5 300
>>>> 0 S root      1465     1  0  75   0    - 98683 schedu 10:20 pts/0  
>>>> 00:00:10 ./memuser 384M 300 10 600
>>>> 0 S root      2148  1293  0  75   0    -   922 pipe_w 10:48 pts/0  
>>>> 00:00:00 grep -E memuser|kscand
>>>>       
>>> The workload has been running for about half an hour, and kswapd cpu
>>> usage doesn't seem significant.  This is a 2GB guest running with my
>>> patch ported to kvm.git HEAD.  Guest is has 2G of memory.
>>>
>>>     
>>
>> I'm running on the per-page-pte-tracking branch, and I am still seeing
>> it.
>> I doubt you want to sit and watch the screen for an hour, so install
>> sysstat if not already, change the sample rate to 1 minute
>> (/etc/cron.d/sysstat), let the server run for a few hours and then run
>> 'sar -u'. You'll see something like this:
>>
>> 10:12:11 AM       LINUX RESTART
>>
>> 10:13:03 AM       CPU     %user     %nice   %system   %iowait     %idle
>> 10:14:01 AM       all      0.08      0.00      2.08      0.35     97.49
>> 10:15:03 AM       all      0.05      0.00      0.79      0.04     99.12
>> 10:15:59 AM       all      0.15      0.00      1.52      0.06     98.27
>> 10:17:01 AM       all      0.04      0.00      0.69      0.04     99.23
>> 10:17:59 AM       all      0.01      0.00      0.39      0.00     99.60
>> 10:18:59 AM       all      0.00      0.00      0.12      0.02     99.87
>> 10:20:02 AM       all      0.18      0.00     14.62      0.09     85.10
>> 10:21:01 AM       all      0.71      0.00     26.35      0.01     72.94
>> 10:22:02 AM       all      0.67      0.00     10.61      0.00     88.72
>> 10:22:59 AM       all      0.14      0.00      1.80      0.00     98.06
>> 10:24:03 AM       all      0.13      0.00      0.50      0.00     99.37
>> 10:24:59 AM       all      0.09      0.00     11.46      0.00     88.45
>> 10:26:03 AM       all      0.16      0.00      0.69      0.03     99.12
>> 10:26:59 AM       all      0.14      0.00     10.01      0.02     89.83
>> 10:28:03 AM       all      0.57      0.00      2.20      0.03     97.20
>> Average:          all      0.21      0.00      5.55      0.05     94.20
>>
>>
>> every one of those jumps in %system time directly correlates to kscand
>> activity. Without the memuser programs running the guest %system time
>> is <1%. The point of this silly memuser program is just to use high
>> memory -- let it age, then make it active again, sit idle, repeat. If
>> you run kvm_stat with -l in the host you'll see the jump in pte
>> writes/updates. An intern here added a timestamp to the kvm_stat
>> output for me which helps to directly correlate guest/host data.
>>
>>
>> I also ran my real guest on the branch. Performance at boot through
>> the first 15 minutes was much better, but I'm still seeing recurring
>> hits every 5 minutes when kscand kicks in. Here's the data from the
>> guest for the first one which happened after 15 minutes of uptime:
>>
>> active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct
>> 24845, dj 59
>>
>> active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct
>> 40868, dj 103
>>
>> active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct
>> 45805, dj 1212
>>
>>   
> 
> We touched 90,000 ptes in 12 seconds.  That's 8,000 ptes per second. 
> Yet we see 180,000 page faults per second in the trace.
> 
> Oh!  Only 45K pages were direct, so the other 45K were shared, with
> perhaps many ptes.  We shoud count ptes, not pages.
> 
> Can you modify page_referenced() to count the numbers of ptes mapped (1
> for direct pages, nr_chains for indirect pages) and print the total
> deltas in active_anon_scan?
> 

Here you go. I've shortened the line lengths to get them to squeeze into
80 columns:

anon_scan, all HighMem zone, 187,910 active pages at loop start:
  count[12] 21462 -> 230,   direct 20469, chains 3479,   dj 58
  count[11] 1338  -> 1162,  direct 227,   chains 26144,  dj 59
  count[8] 29397  -> 5410,  direct 26115, chains 27617,  dj 117
  count[4] 35804  -> 25556, direct 31508, chains 82929,  dj 256
  count[3] 2738   -> 2207,  direct 2680,  chains 58,     dj 7
  count[0] 92580  -> 89509, direct 75024, chains 262834, dj 726
(age number is the index in [])

cache_scan, all HighMem zone, 48,298 active pages at loop start:
  count[12] 3642 -> 2982,  direct 499,  chains 20022, dj 44
  count[8] 11254 -> 11187, direct 7189, chains 9854,  dj 37
  count[4] 15709 -> 15702, direct 5071, chains 9388,  dj 31
(with anon_cache_count bug fixed)

If you sum the direct pages and the chains count for each row, convert
dj into dt (divided by HZ = 100) you get:

( 20469 + 3479 )   / 0.58 = 41289
( 227 + 26144 )    / 0.59 = 44696
( 26115 + 27617 )  / 1.17 = 45924
( 31508 + 82929 )  / 2.56 = 44701
( 2680 + 58 )      / 0.07 = 39114
( 75024 + 262834 ) / 7.26 = 46536
( 499 + 20022 )    / 0.44 = 46638
( 7189 + 9854 )    / 0.37 = 46062
( 5071 + 9388 )    / 0.31 = 46641

For 4 pte writes per direct page or chain entry comes to ~187,000/sec
which is close to the total collected by kvm_stat (data width shrunk to
fit in e-mail; hope this is readable still):


|----------         mmu_          ----------|-----  pf_  -----|
 cache  flood  pde_z    pte_u    pte_w  shado    fixed    guest
   267    271     95    21455    21842    285    22840      165
    66     88      0    12102    12224     88    12458        0
  2042   2133      0   178146   180515   2133   188089      387
  1053   1212      0   187067   188485   1212   193011        8
  4771   4811     88   185129   190998   4825   207490      448
   910    824      7   183066   184050    824   195836       12
   707    785      0   176381   177300    785   180350        6
  1167   1144      0   189618   191014   1144   195902       10
  4238   4193     87   188381   193590   4206   207030      465
  1448   1400      7   187786   189509   1400   198688       21
   982    971      0   187880   189076    971   198405        2
  1165   1208      0   190007   191503   1208   195746       13
  1106   1146      0   189144   190550   1146   195143        0
  4767   4788     96   185802   191704   4802   206362      477
  1388   1431      0   187387   188991   1431   195115        3
   584    551      0    77176    77802    551    84829       10
    12      7      0     3601     3609      7    13497        4
   243    153     91    31085    31333    167    35059      879
    21     18      6     3130     3155     18     3827        2
    21      4      1     4665     4670      4     6825        9

>> The kvm_stat data for this time period is attached due to line lengths.
>>
>>
>> Also, I forgot to mention this before, but there is a bug in the
>> kscand code in the RHEL3U8 kernel. When it scans the cache list it
>> uses the count from the anonymous list:
>>
>>             if (need_active_cache_scan(zone)) {
>>                 for (age = MAX_AGE-1; age >= 0; age--)  {
>>                     scan_active_list(zone, age,
>>                         &zone->active_cache_list[age],
>>                         zone->active_anon_count[age]);
>>                               ^^^^^^^^^^^^^^^^^
>>                     if (current->need_resched)
>>                         schedule();
>>                 }
>>             }
>>
>> When the anonymous count is higher it is scanning the cache list
>> repeatedly. An example of that was captured here:
>>
>> active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon
>> 111967, direct 626, dj 3
>>
>> count anon is active_anon_count[age] which at this moment was 111,967.
>> There were only 222 entries in the cache list, but the count value
>> passed to scan_active_list was 111,967. When the cache list has a lot
>> of direct pages, that causes a larger hit on kvm than needed. That
>> said, I have to live with the bug in the guest.
>>   
> 
> For debugging, can you fix it?  It certainly has a large impact.
> 
yes, I have run a few tests with it fixed to get a ballpark on the
impact. The fix is included in the number above.

> Perhaps it is fixed in an update kernel.  There's a 2.4.21-50.EL in the
> centos 3.8 update repos.
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-06-02 16:42                                                               ` David S. Ahern
@ 2008-06-05  8:37                                                                 ` Avi Kivity
  2008-06-05 16:20                                                                   ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-06-05  8:37 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

David S. Ahern wrote:
>> Oh!  Only 45K pages were direct, so the other 45K were shared, with
>> perhaps many ptes.  We shoud count ptes, not pages.
>>
>> Can you modify page_referenced() to count the numbers of ptes mapped (1
>> for direct pages, nr_chains for indirect pages) and print the total
>> deltas in active_anon_scan?
>>
>>     
>
> Here you go. I've shortened the line lengths to get them to squeeze into
> 80 columns:
>
> anon_scan, all HighMem zone, 187,910 active pages at loop start:
>   count[12] 21462 -> 230,   direct 20469, chains 3479,   dj 58
>   count[11] 1338  -> 1162,  direct 227,   chains 26144,  dj 59
>   count[8] 29397  -> 5410,  direct 26115, chains 27617,  dj 117
>   count[4] 35804  -> 25556, direct 31508, chains 82929,  dj 256
>   count[3] 2738   -> 2207,  direct 2680,  chains 58,     dj 7
>   count[0] 92580  -> 89509, direct 75024, chains 262834, dj 726
> (age number is the index in [])
>
>   

Where do all those ptes come from?  that's 180K pages (most of highmem), 
but with 550K ptes.

The memuser workload doesn't use fork(), so there shouldn't be any 
indirect ptes.

We might try to unshadow the fixmap page; that means we don't have to do 
4 fixmap pte accesses per pte scanned.

The kernel uses two methods for clearing the accessed bit:

For direct pages:

                if (pte_young(*pte) && ptep_test_and_clear_young(pte))
                        referenced++;

(two accesses)

For indirect pages:

                                if (ptep_test_and_clear_young(pte))
                                        referenced++;

(one access)

Which have to be emulated if we don't shadow the fixmap.  With the data 
above, that translates to 700K emulations with your numbers above, vs 
2200K emulations, a 3X improvement.  I'm not sure it will be sufficient 
given that we're reducing a 10-second kscand scan into a 3-second scan.

> If you sum the direct pages and the chains count for each row, convert
> dj into dt (divided by HZ = 100) you get:
>
> ( 20469 + 3479 )   / 0.58 = 41289
> ( 227 + 26144 )    / 0.59 = 44696
> ( 26115 + 27617 )  / 1.17 = 45924
> ( 31508 + 82929 )  / 2.56 = 44701
> ( 2680 + 58 )      / 0.07 = 39114
> ( 75024 + 262834 ) / 7.26 = 46536
> ( 499 + 20022 )    / 0.44 = 46638
> ( 7189 + 9854 )    / 0.37 = 46062
> ( 5071 + 9388 )    / 0.31 = 46641
>
> For 4 pte writes per direct page or chain entry comes to ~187,000/sec
> which is close to the total collected by kvm_stat (data width shrunk to
> fit in e-mail; hope this is readable still):
>
>
> |----------         mmu_          ----------|-----  pf_  -----|
>  cache  flood  pde_z    pte_u    pte_w  shado    fixed    guest
>    267    271     95    21455    21842    285    22840      165
>     66     88      0    12102    12224     88    12458        0
>   2042   2133      0   178146   180515   2133   188089      387
>   1053   1212      0   187067   188485   1212   193011        8
>   4771   4811     88   185129   190998   4825   207490      448
>    910    824      7   183066   184050    824   195836       12
>    707    785      0   176381   177300    785   180350        6
>   1167   1144      0   189618   191014   1144   195902       10
>   4238   4193     87   188381   193590   4206   207030      465
>   1448   1400      7   187786   189509   1400   198688       21
>    982    971      0   187880   189076    971   198405        2
>   1165   1208      0   190007   191503   1208   195746       13
>   1106   1146      0   189144   190550   1146   195143        0
>   4767   4788     96   185802   191704   4802   206362      477
>   1388   1431      0   187387   188991   1431   195115        3
>    584    551      0    77176    77802    551    84829       10
>     12      7      0     3601     3609      7    13497        4
>    243    153     91    31085    31333    167    35059      879
>     21     18      6     3130     3155     18     3827        2
>     21      4      1     4665     4670      4     6825        9
>
>   
>>> The kvm_stat data for this time period is attached due to line lengths.
>>>
>>>
>>> Also, I forgot to mention this before, but there is a bug in the
>>> kscand code in the RHEL3U8 kernel. When it scans the cache list it
>>> uses the count from the anonymous list:
>>>
>>>             if (need_active_cache_scan(zone)) {
>>>                 for (age = MAX_AGE-1; age >= 0; age--)  {
>>>                     scan_active_list(zone, age,
>>>                         &zone->active_cache_list[age],
>>>                         zone->active_anon_count[age]);
>>>                               ^^^^^^^^^^^^^^^^^
>>>                     if (current->need_resched)
>>>                         schedule();
>>>                 }
>>>             }
>>>
>>> When the anonymous count is higher it is scanning the cache list
>>> repeatedly. An example of that was captured here:
>>>
>>> active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon
>>> 111967, direct 626, dj 3
>>>
>>> count anon is active_anon_count[age] which at this moment was 111,967.
>>> There were only 222 entries in the cache list, but the count value
>>> passed to scan_active_list was 111,967. When the cache list has a lot
>>> of direct pages, that causes a larger hit on kvm than needed. That
>>> said, I have to live with the bug in the guest.
>>>   
>>>       
>> For debugging, can you fix it?  It certainly has a large impact.
>>
>>     
> yes, I have run a few tests with it fixed to get a ballpark on the
> impact. The fix is included in the number above.
>
>   
>> Perhaps it is fixed in an update kernel.  There's a 2.4.21-50.EL in the
>> centos 3.8 update repos.
>>
>>     

It seems to have been fixed there.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-06-05  8:37                                                                 ` Avi Kivity
@ 2008-06-05 16:20                                                                   ` David S. Ahern
  2008-06-06 16:40                                                                     ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-06-05 16:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm



Avi Kivity wrote:
> David S. Ahern wrote:
>>> Oh!  Only 45K pages were direct, so the other 45K were shared, with
>>> perhaps many ptes.  We shoud count ptes, not pages.
>>>
>>> Can you modify page_referenced() to count the numbers of ptes mapped (1
>>> for direct pages, nr_chains for indirect pages) and print the total
>>> deltas in active_anon_scan?
>>>
>>>     
>>
>> Here you go. I've shortened the line lengths to get them to squeeze into
>> 80 columns:
>>
>> anon_scan, all HighMem zone, 187,910 active pages at loop start:
>>   count[12] 21462 -> 230,   direct 20469, chains 3479,   dj 58
>>   count[11] 1338  -> 1162,  direct 227,   chains 26144,  dj 59
>>   count[8] 29397  -> 5410,  direct 26115, chains 27617,  dj 117
>>   count[4] 35804  -> 25556, direct 31508, chains 82929,  dj 256
>>   count[3] 2738   -> 2207,  direct 2680,  chains 58,     dj 7
>>   count[0] 92580  -> 89509, direct 75024, chains 262834, dj 726
>> (age number is the index in [])
>>
>>   
> 
> Where do all those ptes come from?  that's 180K pages (most of highmem),
> but with 550K ptes.
> 
> The memuser workload doesn't use fork(), so there shouldn't be any
> indirect ptes.
> 
> We might try to unshadow the fixmap page; that means we don't have to do
> 4 fixmap pte accesses per pte scanned.
> 
> The kernel uses two methods for clearing the accessed bit:
> 
> For direct pages:
> 
>                if (pte_young(*pte) && ptep_test_and_clear_young(pte))
>                        referenced++;
> 
> (two accesses)
> 
> For indirect pages:
> 
>                                if (ptep_test_and_clear_young(pte))
>                                        referenced++;
> 
> (one access)
> 
> Which have to be emulated if we don't shadow the fixmap.  With the data
> above, that translates to 700K emulations with your numbers above, vs
> 2200K emulations, a 3X improvement.  I'm not sure it will be sufficient
> given that we're reducing a 10-second kscand scan into a 3-second scan.
> 

A 3-second scan is much better and incomparison to where kvm was when I
started this e-mail thread (as high as 30-seconds for a scan) it's a
10-fold improvement.

I gave a shot at implementing your suggestion, but evidently I am still
not understanding the shadow implementation. Can you suggest a patch to
try this out?

david

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-06-05 16:20                                                                   ` David S. Ahern
@ 2008-06-06 16:40                                                                     ` Avi Kivity
  2008-06-19  4:20                                                                       ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-06-06 16:40 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

David S. Ahern wrote:
> I gave a shot at implementing your suggestion, but evidently I am still
> not understanding the shadow implementation. Can you suggest a patch to
> try this out?
>   

We can have a hacking session in kvm forum.  Bring a guest on your laptop.

It isn't going to be easy to both fix the problem and also not introduce 
a regression somewhere else.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-06-06 16:40                                                                     ` Avi Kivity
@ 2008-06-19  4:20                                                                       ` David S. Ahern
  2008-06-22  6:34                                                                         ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-06-19  4:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Avi:

We did not get a chance to do this at the Forum. I'd be interested in
whatever options you have for reducing the scan time further (e.g., try
to get scan time down to 1-2 seconds).

thanks,

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> I gave a shot at implementing your suggestion, but evidently I am still
>> not understanding the shadow implementation. Can you suggest a patch to
>> try this out?
>>   
> 
> We can have a hacking session in kvm forum.  Bring a guest on your laptop.
> 
> It isn't going to be easy to both fix the problem and also not introduce
> a regression somewhere else.
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-06-19  4:20                                                                       ` David S. Ahern
@ 2008-06-22  6:34                                                                         ` Avi Kivity
  2008-06-23 14:09                                                                           ` David S. Ahern
  0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-06-22  6:34 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

[-- Attachment #1: Type: text/plain, Size: 1135 bytes --]

David S. Ahern wrote:
> Avi:
>
> We did not get a chance to do this at the Forum. I'd be interested in
> whatever options you have for reducing the scan time further (e.g., try
> to get scan time down to 1-2 seconds).
>
>   

I'm unlikely to get time to do this properly for at least a week, as 
this will be quite difficult and I'm already horribly backlogged.  
However there's an alternative option, modifying the source and getting 
it upstreamed, as I think RHEL 3 is still maintained.

The attached patch (untested) should give a 3X boost for kmap_atomics, 
by folding the two accesses to set the pte into one, and by dropping the 
access that clears the pte.  Unfortunately it breaks the ABI, since 
external modules will inline the original kmap_atomic() which expects 
the pte to be cleared.

This can be worked around by allocating new fixmap slots for kmap_atomic 
with the new behavior, and keeping the old slots with the old behavior, 
but we should first see if the maintainers are open to performance 
optimizations targeting kvm.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


[-- Attachment #2: faster-2.4-kmap_atomic.patch --]
[-- Type: text/x-patch, Size: 1057 bytes --]

--- include/asm-i386/atomic_kmap.h.orig	2007-06-12 00:24:29.000000000 +0300
+++ include/asm-i386/atomic_kmap.h	2008-06-22 09:23:26.000000000 +0300
@@ -51,18 +51,13 @@ static inline void *__kmap_atomic(struct
 
 	idx = type + KM_TYPE_NR*smp_processor_id();
 	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#if HIGHMEM_DEBUG
-	if (!pte_none(*(kmap_pte-idx)))
-		out_of_line_bug();
-#else
 	/*
 	 * Performance optimization - do not flush if the new
 	 * pte is the same as the old one:
 	 */
 	if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot)))
 		return (void *) vaddr;
-#endif
-	set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
+	set_pte_atomic(kmap_pte-idx, mk_pte(page, kmap_prot));
 	__flush_tlb_one(vaddr);
 
 	return (void*) vaddr;
@@ -77,12 +72,6 @@ static inline void __kunmap_atomic(void 
 	if (vaddr != __fix_to_virt(FIX_KMAP_BEGIN+idx))
 		out_of_line_bug();
 
-	/*
-	 * force other mappings to Oops if they'll try to access
-	 * this pte without first remap it
-	 */
-	pte_clear(kmap_pte-idx);
-	__flush_tlb_one(vaddr);
 #endif
 }
 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-06-22  6:34                                                                         ` Avi Kivity
@ 2008-06-23 14:09                                                                           ` David S. Ahern
  2008-06-25  9:51                                                                             ` Avi Kivity
  0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-06-23 14:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Avi Kivity wrote:
> David S. Ahern wrote:
>> Avi:
>>
>> We did not get a chance to do this at the Forum. I'd be interested in
>> whatever options you have for reducing the scan time further (e.g., try
>> to get scan time down to 1-2 seconds).
>>
>>   
> 
> I'm unlikely to get time to do this properly for at least a week, as
> this will be quite difficult and I'm already horribly backlogged. 
> However there's an alternative option, modifying the source and getting
> it upstreamed, as I think RHEL 3 is still maintained.
> 
> The attached patch (untested) should give a 3X boost for kmap_atomics,
> by folding the two accesses to set the pte into one, and by dropping the
> access that clears the pte.  Unfortunately it breaks the ABI, since
> external modules will inline the original kmap_atomic() which expects
> the pte to be cleared.
> 
> This can be worked around by allocating new fixmap slots for kmap_atomic
> with the new behavior, and keeping the old slots with the old behavior,
> but we should first see if the maintainers are open to performance
> optimizations targeting kvm.
> 
RHEL3 is in Maintenance mode (for an explanation see
http://www.redhat.com/security/updates/errata/) which means performance
enhancement patches will not make it in.

Also, I'm going to be out of the office for a couple of weeks in July,
so I will need to put this aside until mid-August or so. I'll reevaluate
options then.

david

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
  2008-06-23 14:09                                                                           ` David S. Ahern
@ 2008-06-25  9:51                                                                             ` Avi Kivity
  0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-06-25  9:51 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

David S. Ahern wrote:
>
> RHEL3 is in Maintenance mode (for an explanation see
> http://www.redhat.com/security/updates/errata/) which means performance
> enhancement patches will not make it in.
>
>   

Scratch that idea, then.

> Also, I'm going to be out of the office for a couple of weeks in July,
> so I will need to put this aside until mid-August or so. I'll reevaluate
> options then.
>   

One thing I'm looking at is implementing out-of-sync like Xen, which 
looks like it will obsolete the entire emulate vs flood thing at the 
cost of making unshadowing a little more expensive and consuming more 
memory.  See 
http://thread.gmane.org/gmane.comp.emulators.xen.devel/52557 (and 58, 
59, 60).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2008-06-25  9:51 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-16  0:15 performance with guests running 2.4 kernels (specifically RHEL3) David S. Ahern
2008-04-16  8:46 ` Avi Kivity
2008-04-17 21:12   ` David S. Ahern
2008-04-18  7:57     ` Avi Kivity
2008-04-21  4:31       ` David S. Ahern
2008-04-21  9:19         ` Avi Kivity
2008-04-21 17:07           ` David S. Ahern
2008-04-22 20:23           ` David S. Ahern
2008-04-23  8:04             ` Avi Kivity
2008-04-23 15:23               ` David S. Ahern
2008-04-23 15:53                 ` Avi Kivity
2008-04-23 16:39                   ` David S. Ahern
2008-04-24 17:25                     ` David S. Ahern
2008-04-26  6:43                       ` Avi Kivity
2008-04-26  6:20                     ` Avi Kivity
2008-04-25 17:33                 ` David S. Ahern
2008-04-26  6:45                   ` Avi Kivity
2008-04-28 18:15                   ` Marcelo Tosatti
2008-04-28 23:45                     ` David S. Ahern
2008-04-30  4:18                       ` David S. Ahern
2008-04-30  9:55                         ` Avi Kivity
2008-04-30 13:39                           ` David S. Ahern
2008-04-30 13:49                             ` Avi Kivity
2008-05-11 12:32                               ` Avi Kivity
2008-05-11 13:36                                 ` Avi Kivity
2008-05-13  3:49                                   ` David S. Ahern
2008-05-13  7:25                                     ` Avi Kivity
2008-05-14 20:35                                       ` David S. Ahern
2008-05-15 10:53                                         ` Avi Kivity
2008-05-17  4:31                                           ` David S. Ahern
     [not found]                                             ` <482FCEE1.5040306@qumranet.com>
     [not found]                                               ` <4830F90A.1020809@cisco.com>
2008-05-19  4:14                                                 ` [kvm-devel] " David S. Ahern
2008-05-19 14:27                                                   ` Avi Kivity
2008-05-19 16:25                                                     ` David S. Ahern
2008-05-19 17:04                                                       ` Avi Kivity
2008-05-20 14:19                                                     ` Avi Kivity
2008-05-20 14:34                                                       ` Avi Kivity
2008-05-22 22:08                                                       ` David S. Ahern
2008-05-28 10:51                                                         ` Avi Kivity
2008-05-28 14:13                                                           ` David S. Ahern
2008-05-28 14:35                                                             ` Avi Kivity
2008-05-28 19:49                                                               ` David S. Ahern
2008-05-29  6:37                                                                 ` Avi Kivity
2008-05-28 14:48                                                             ` Andrea Arcangeli
2008-05-28 14:57                                                               ` Avi Kivity
2008-05-28 15:39                                                                 ` David S. Ahern
2008-05-29 11:49                                                                   ` Avi Kivity
2008-05-29 12:10                                                                   ` Avi Kivity
2008-05-29 13:49                                                                     ` David S. Ahern
2008-05-29 14:08                                                                       ` Avi Kivity
2008-05-28 15:58                                                                 ` Andrea Arcangeli
2008-05-28 15:37                                                               ` Avi Kivity
2008-05-28 15:43                                                                 ` David S. Ahern
2008-05-28 17:04                                                                   ` Andrea Arcangeli
2008-05-28 17:24                                                                     ` David S. Ahern
2008-05-29 10:01                                                                     ` Avi Kivity
2008-05-29 14:27                                                                       ` Andrea Arcangeli
2008-05-29 15:11                                                                         ` David S. Ahern
2008-05-29 15:16                                                                         ` Avi Kivity
2008-05-30 13:12                                                                           ` Andrea Arcangeli
2008-05-31  7:39                                                                             ` Avi Kivity
2008-05-29 16:42                                                           ` David S. Ahern
2008-05-31  8:16                                                             ` Avi Kivity
2008-06-02 16:42                                                               ` David S. Ahern
2008-06-05  8:37                                                                 ` Avi Kivity
2008-06-05 16:20                                                                   ` David S. Ahern
2008-06-06 16:40                                                                     ` Avi Kivity
2008-06-19  4:20                                                                       ` David S. Ahern
2008-06-22  6:34                                                                         ` Avi Kivity
2008-06-23 14:09                                                                           ` David S. Ahern
2008-06-25  9:51                                                                             ` Avi Kivity
2008-04-30 13:56                             ` Daniel P. Berrange
2008-04-30 14:23                               ` David S. Ahern
2008-04-23  8:03     ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.