* performance with guests running 2.4 kernels (specifically RHEL3)
@ 2008-04-16 0:15 David S. Ahern
2008-04-16 8:46 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-16 0:15 UTC (permalink / raw)
To: kvm-devel
I have been looking at RHEL3 based guests lately, and to say the least the
performance is horrible. Rather than write a long tome on what I've done and
observed, I'd like to find out if anyone has some insights or known problem
areas running 2.4 guests. The short of it is that % system time spikes from time
to time (e.g., on exec of a new process such as running /bin/true).
I do not see the problem running RHEL3 on ESX, and an equivalent VM running
RHEL4 runs fine. That suggests that the 2.4 kernel is doing something in a way
that is not handled efficiently by kvm.
Can someone shed some light on it?
thanks,
david
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-16 0:15 performance with guests running 2.4 kernels (specifically RHEL3) David S. Ahern
@ 2008-04-16 8:46 ` Avi Kivity
2008-04-17 21:12 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-16 8:46 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> I have been looking at RHEL3 based guests lately, and to say the least the
> performance is horrible. Rather than write a long tome on what I've done and
> observed, I'd like to find out if anyone has some insights or known problem
> areas running 2.4 guests. The short of it is that % system time spikes from time
> to time (e.g., on exec of a new process such as running /bin/true).
>
> I do not see the problem running RHEL3 on ESX, and an equivalent VM running
> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something in a way
> that is not handled efficiently by kvm.
>
> Can someone shed some light on it?
>
It's not something that I test regularly. If you're running a 32-bit
kernel, I'd suspect kmap(), or perhaps false positives from the fork
detector.
kvmtrace will probably give enough info to tell exactly what's going on;
'kvmstat -1' while the badness is happening may also help.
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-16 8:46 ` Avi Kivity
@ 2008-04-17 21:12 ` David S. Ahern
2008-04-18 7:57 ` Avi Kivity
2008-04-23 8:03 ` Avi Kivity
0 siblings, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-17 21:12 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
kvm_stat -1 is practically impossible to time correctly to get a good snippet.
kvmtrace is a fascinating tool. I captured trace data that encompassed one
intense period where the VM appeared to freeze (no terminal response for a few
seconds).
After converting to text I examined an arbitrary section in time (how do you
correlate tsc to unix epoch?), and it shows vcpu0 hammered with interrupts and
vcpu1 hammered with page faults. (I put the representative data below; I can
send the binary or text files if you really want to see them.) All toll over
about a 10-12 second time period the trace text files contain 8426221 lines and
2051344 of them are PAGE_FAULTs (that's 24% of the text lines which seems really
high).
david
---------------------------------
vcpu0 data:
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400020536 (+ 1712) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400096784 (+ 76248) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400098576 (+ 1792) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400114528 (+ 15952) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400116328 (+ 1800) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400137216 (+ 20888) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400138840 (+ 1624) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400209344 (+ 70504) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400211056 (+ 1712) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400226312 (+ 15256) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400228040 (+ 1728) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400248688 (+ 20648) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
vcpu1 data:
9968400002032 (+ 3808) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c016127f ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400005448 (+ 3416) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400009832 (+ 4384) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c016104a ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x0000000b, virt = 0x00000000 fffb6f88 ]
9968400071584 (+ 61752) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400075608 (+ 4024) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400083528 (+ 7920) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400087288 (+ 3760) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400097312 (+ 10024) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400103064 (+ 5752) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160f9c ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400116624 (+ 13560) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400120424 (+ 3800) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160fa1 ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400123856 (+ 3432) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400128208 (+ 4352) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160dab ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000009, virt = 0x00000000 fffb6d28 ]
9968400183848 (+ 55640) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400188232 (+ 4384) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160e4d ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400196160 (+ 7928) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400199928 (+ 3768) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160e54 ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400209864 (+ 9936) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400214984 (+ 5120) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160f9c ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400228232 (+ 13248) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400232000 (+ 3768) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160fa1 ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400235424 (+ 3424) VMENTRY vcpu = 0x00000000 pid = 0x000011ea
9968400239816 (+ 4392) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160dab ]
0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
0x00000009, virt = 0x00000000 fffb6d30 ]
Avi Kivity wrote:
> David S. Ahern wrote:
>> I have been looking at RHEL3 based guests lately, and to say the least
>> the
>> performance is horrible. Rather than write a long tome on what I've
>> done and
>> observed, I'd like to find out if anyone has some insights or known
>> problem
>> areas running 2.4 guests. The short of it is that % system time spikes
>> from time
>> to time (e.g., on exec of a new process such as running /bin/true).
>>
>> I do not see the problem running RHEL3 on ESX, and an equivalent VM
>> running
>> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something
>> in a way
>> that is not handled efficiently by kvm.
>>
>> Can someone shed some light on it?
>>
>
> It's not something that I test regularly. If you're running a 32-bit
> kernel, I'd suspect kmap(), or perhaps false positives from the fork
> detector.
>
> kvmtrace will probably give enough info to tell exactly what's going on;
> 'kvmstat -1' while the badness is happening may also help.
>
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-17 21:12 ` David S. Ahern
@ 2008-04-18 7:57 ` Avi Kivity
2008-04-21 4:31 ` David S. Ahern
2008-04-23 8:03 ` Avi Kivity
1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-18 7:57 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> kvm_stat -1 is practically impossible to time correctly to get a good snippet.
>
> kvmtrace is a fascinating tool. I captured trace data that encompassed one
> intense period where the VM appeared to freeze (no terminal response for a few
> seconds).
>
> After converting to text I examined an arbitrary section in time (how do you
> correlate tsc to unix epoch?), and it shows vcpu0 hammered with interrupts and
> vcpu1 hammered with page faults. (I put the representative data below; I can
> send the binary or text files if you really want to see them.) All toll over
> about a 10-12 second time period the trace text files contain 8426221 lines and
> 2051344 of them are PAGE_FAULTs (that's 24% of the text lines which seems really
> high).
>
> david
>
>
> vcpu1 data:
>
> 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db4 ]
> 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
> 0x00000009, virt = 0x00000000 fffb6d28 ]
> 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db4 ]
> 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db4 ]
> 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode =
> 0x00000009, virt = 0x00000000 fffb6d30 ]
>
>
>
The pattern here is c0009db4, c0009db0, fffb6xxx, c0009db0. Setting a
pte at c0009db0, accessing the page mapped by the pte, unmapping the
pte. Note that c0009db0 (bits 3:11) == 0x1b6 == fffb6xxx (bits 12:20).
That's a kmap_atomic() + access +kunmap_atomic() sequence.
The expensive accesses ~50K cycles) seem to be the onces at fffb6xxx.
Now theses shouldn't show up at all -- the kvm_mmu_pte_write() ought to
have set up the ptes correctly.
Can you add a trace at mmu_guess_page_from_pte_write(), right before "if
(is_present_pte(gpte))"? I'm interested in gpa and gpte. Also a trace
at kvm_mmu_pte_write(), where it sets flooded = 1 (hmm, try to increase
the 3 to 4 in the line right above that, maybe the fork detector is
misfiring).
---------------------------------
vcpu0 data:
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400020536 (+ 1712) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400096784 (+ 76248) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400098576 (+ 1792) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400114528 (+ 15952) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400116328 (+ 1800) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400137216 (+ 20888) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400138840 (+ 1624) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400209344 (+ 70504) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400211056 (+ 1712) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400226312 (+ 15256) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ]
9968400228040 (+ 1728) VMENTRY vcpu = 0x00000001 pid = 0x000011ea
9968400248688 (+ 20648) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
Those are probably IPIs due to the kmaps above.
>
> Avi Kivity wrote:
>
>> David S. Ahern wrote:
>>
>>> I have been looking at RHEL3 based guests lately, and to say the least
>>> the
>>> performance is horrible. Rather than write a long tome on what I've
>>> done and
>>> observed, I'd like to find out if anyone has some insights or known
>>> problem
>>> areas running 2.4 guests. The short of it is that % system time spikes
>>> from time
>>> to time (e.g., on exec of a new process such as running /bin/true).
>>>
>>> I do not see the problem running RHEL3 on ESX, and an equivalent VM
>>> running
>>> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something
>>> in a way
>>> that is not handled efficiently by kvm.
>>>
>>> Can someone shed some light on it?
>>>
>>>
>> It's not something that I test regularly. If you're running a 32-bit
>> kernel, I'd suspect kmap(), or perhaps false positives from the fork
>> detector.
>>
>> kvmtrace will probably give enough info to tell exactly what's going on;
>> 'kvmstat -1' while the badness is happening may also help.
>>
>>
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-18 7:57 ` Avi Kivity
@ 2008-04-21 4:31 ` David S. Ahern
2008-04-21 9:19 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-21 4:31 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
I added the traces and captured data over another apparent lockup of the guest.
This seems to be representative of the sequence (pid/vcpu removed).
(+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+3632) VMENTRY
(+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb61c8 ]
(+ 54928) VMENTRY
(+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41c5d363 ]
(+8432) VMENTRY
(+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+ 13832) VMENTRY
(+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+3712) VMENTRY
(+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb61d0 ]
(+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 3d55d047 ]
(+ 65216) VMENTRY
(+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 3d598363 ]
(+8640) VMENTRY
(+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+ 14160) VMENTRY
I can forward a more complete time snippet if you'd like. vcpu0 + corresponding
vcpu1 files have 85000 total lines and compressed the files total ~500k.
I did not see the FLOODED trace come out during this sample though I did bump
the count from 3 to 4 as you suggested.
Correlating rip addresses to the 2.4 kernel:
c0160d00-c0161290 = page_referenced
It looks like the event is kscand running through the pages. I suspected this
some time ago, and tried tweaking the kscand_work_percent sysctl variable. It
appeared to lower the peak of the spikes, but maybe I imagined it. I believe
lowering that value makes kscand wake up more often but do less work (page
scanning) each time it is awakened.
david
Avi Kivity wrote:
> Can you add a trace at mmu_guess_page_from_pte_write(), right before "if
> (is_present_pte(gpte))"? I'm interested in gpa and gpte. Also a trace
> at kvm_mmu_pte_write(), where it sets flooded = 1 (hmm, try to increase
> the 3 to 4 in the line right above that, maybe the fork detector is
> misfiring).
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-21 4:31 ` David S. Ahern
@ 2008-04-21 9:19 ` Avi Kivity
2008-04-21 17:07 ` David S. Ahern
2008-04-22 20:23 ` David S. Ahern
0 siblings, 2 replies; 73+ messages in thread
From: Avi Kivity @ 2008-04-21 9:19 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> I added the traces and captured data over another apparent lockup of the guest.
> This seems to be representative of the sequence (pid/vcpu removed).
>
> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+3632) VMENTRY
> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb61c8 ]
> (+ 54928) VMENTRY
>
Can you oprofile the host to see where the 54K cycles are spent?
> (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41c5d363 ]
> (+8432) VMENTRY
> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+ 13832) VMENTRY
>
>
> (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+3712) VMENTRY
> (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb61d0 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 3d55d047 ]
>
This indeed has the accessed bit clear.
> (+ 65216) VMENTRY
> (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 3d598363 ]
>
This has the accessed bit set and the user bit clear, and the pte
pointing at the previous pte_write gpa. Looks like a kmap_atomic().
> (+8640) VMENTRY
> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+ 14160) VMENTRY
>
> I can forward a more complete time snippet if you'd like. vcpu0 + corresponding
> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>
> I did not see the FLOODED trace come out during this sample though I did bump
> the count from 3 to 4 as you suggested.
>
>
>
Bumping the count was supposed to remove the flooding...
> Correlating rip addresses to the 2.4 kernel:
>
> c0160d00-c0161290 = page_referenced
>
> It looks like the event is kscand running through the pages. I suspected this
> some time ago, and tried tweaking the kscand_work_percent sysctl variable. It
> appeared to lower the peak of the spikes, but maybe I imagined it. I believe
> lowering that value makes kscand wake up more often but do less work (page
> scanning) each time it is awakened.
>
>
What does 'top' in the guest show (perhaps sorted by total cpu time
rather than instantaneous usage)?
What host kernel are you running? How many host cpus?
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-21 9:19 ` Avi Kivity
@ 2008-04-21 17:07 ` David S. Ahern
2008-04-22 20:23 ` David S. Ahern
1 sibling, 0 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-21 17:07 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
host:
2.6.25-rc8, x86_64, kvm-66
1 dual-core Xeon(R) CPU 3050 @ 2.13GHz
6 GB RAM
(This behavior also occurs on a larger server with 2 dual-core Xeon(R) CPU
5140 @ 2.33GHz, 4 GB RAM. Same kernel and kvm versions.)
guest:
RHEL3 U8 (2.4.21-47.ELsmp), 2 vcpus, 2 GB RAM
As usual, waited for a guest "event" -- high system time in guest which appears
to lock it up. Following the event, kscand was the top CPU user (cumulative
time) in the guest.
During the event, 2 qemu threads are pegging the host CPU at 100%. Top samples
from oprofile (oprofile was started after the freeze start and stopped when
guest response returned):
samples % image name app name symbol name
171716 35.1350 kvm-intel.ko kvm_intel vmx_vcpu_run
45836 9.3786 vmlinux vmlinux copy_user_generic_string
39417 8.0652 kvm.ko kvm kvm_read_guest_atomic
23604 4.8296 vmlinux vmlinux add_preempt_count
22878 4.6811 vmlinux vmlinux __smp_call_function_mask
16143 3.3030 kvm.ko kvm gfn_to_hva
14648 2.9971 vmlinux vmlinux sub_preempt_count
14589 2.9851 kvm.ko kvm __gfn_to_memslot
11666 2.3870 kvm.ko kvm unalias_gfn
10834 2.2168 kvm.ko kvm kvm_mmu_zap_page
10532 2.1550 kvm.ko kvm paging64_prefetch_page
6285 1.2860 kvm-intel.ko kvm_intel handle_exception
6066 1.2412 kvm.ko kvm kvm_arch_vcpu_ioctl_run
4741 0.9701 kvm.ko kvm kvm_add_trace
3801 0.7777 vmlinux vmlinux __copy_from_user_inatomic
3592 0.7350 vmlinux vmlinux follow_page
3326 0.6805 kvm.ko kvm mmu_memory_cache_alloc
3317 0.6787 kvm-intel.ko kvm_intel kvm_handle_exit
2971 0.6079 kvm.ko kvm paging64_page_fault
2777 0.5682 kvm.ko kvm paging64_walk_addr
2294 0.4694 kvm.ko kvm kvm_mmu_pte_write
2278 0.4661 kvm.ko kvm kvm_flush_remote_tlbs
2266 0.4636 kvm-intel.ko kvm_intel vmcs_writel
2086 0.4268 kvm.ko kvm mmu_set_spte
2041 0.4176 kvm.ko kvm kvm_read_guest
1615 0.3304 vmlinux vmlinux free_hot_cold_page
david
Avi Kivity wrote:
> David S. Ahern wrote:
>> I added the traces and captured data over another apparent lockup of
>> the guest.
>> This seems to be representative of the sequence (pid/vcpu removed).
>>
>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3632) VMENTRY
>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61c8 ]
>> (+ 54928) VMENTRY
>>
>
> Can you oprofile the host to see where the 54K cycles are spent?
>
>> (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 41c5d363 ]
>> (+8432) VMENTRY
>> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+ 13832) VMENTRY
>>
>>
>> (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3712) VMENTRY
>> (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61d0 ]
>> (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000
>> 3d55d047 ]
>>
>
> This indeed has the accessed bit clear.
>
>> (+ 65216) VMENTRY
>> (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 3d598363 ]
>>
>
> This has the accessed bit set and the user bit clear, and the pte
> pointing at the previous pte_write gpa. Looks like a kmap_atomic().
>
>> (+8640) VMENTRY
>> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+ 14160) VMENTRY
>>
>> I can forward a more complete time snippet if you'd like. vcpu0 +
>> corresponding
>> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>>
>> I did not see the FLOODED trace come out during this sample though I
>> did bump
>> the count from 3 to 4 as you suggested.
>>
>>
>>
>
> Bumping the count was supposed to remove the flooding...
>
>> Correlating rip addresses to the 2.4 kernel:
>>
>> c0160d00-c0161290 = page_referenced
>>
>> It looks like the event is kscand running through the pages. I
>> suspected this
>> some time ago, and tried tweaking the kscand_work_percent sysctl
>> variable. It
>> appeared to lower the peak of the spikes, but maybe I imagined it. I
>> believe
>> lowering that value makes kscand wake up more often but do less work
>> (page
>> scanning) each time it is awakened.
>>
>>
>
> What does 'top' in the guest show (perhaps sorted by total cpu time
> rather than instantaneous usage)?
>
> What host kernel are you running? How many host cpus?
>
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-21 9:19 ` Avi Kivity
2008-04-21 17:07 ` David S. Ahern
@ 2008-04-22 20:23 ` David S. Ahern
2008-04-23 8:04 ` Avi Kivity
1 sibling, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-22 20:23 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles:
1. before vcpu->arch.mmu.page_fault()
2. after vcpu->arch.mmu.page_fault()
3. after mmu_topup_memory_caches()
4. after emulate_instruction()
So the delta in the trace reports show:
- cycles required for arch.mmu.page_fault (tracer 2)
- cycles required for mmu_topup_memory_caches(tracer 3)
- cycles required for emulate_instruction() (tracer 4)
I captured trace data for ~5-seconds during one of the usual events (again this
time it was due to kscand in the guest). I ran the formatted trace data through
an awk script to summarize:
TSC cycles tracer2 tracer3 tracer4
0 - 10,000: 295067 213251 115873
10,001 - 25,000: 7682 1004 98336
25,001 - 50,000: 201 15 36
50,001 - 100,000: 100655 0 10
> 100,000: 117 0 15
This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl
5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it
took longer than 50,000 cycles. The page_fault function getting run is
paging64_page_fault.
mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times,
most of them relatively quickly.
Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few
host processes could interrupt it.
david
Avi Kivity wrote:
> David S. Ahern wrote:
>> I added the traces and captured data over another apparent lockup of
>> the guest.
>> This seems to be representative of the sequence (pid/vcpu removed).
>>
>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3632) VMENTRY
>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61c8 ]
>> (+ 54928) VMENTRY
>>
>
> Can you oprofile the host to see where the 54K cycles are spent?
>
>> (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 41c5d363 ]
>> (+8432) VMENTRY
>> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+ 13832) VMENTRY
>>
>>
>> (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3712) VMENTRY
>> (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61d0 ]
>> (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000
>> 3d55d047 ]
>>
>
> This indeed has the accessed bit clear.
>
>> (+ 65216) VMENTRY
>> (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 3d598363 ]
>>
>
> This has the accessed bit set and the user bit clear, and the pte
> pointing at the previous pte_write gpa. Looks like a kmap_atomic().
>
>> (+8640) VMENTRY
>> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+ 14160) VMENTRY
>>
>> I can forward a more complete time snippet if you'd like. vcpu0 +
>> corresponding
>> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>>
>> I did not see the FLOODED trace come out during this sample though I
>> did bump
>> the count from 3 to 4 as you suggested.
>>
>>
>>
>
> Bumping the count was supposed to remove the flooding...
>
>> Correlating rip addresses to the 2.4 kernel:
>>
>> c0160d00-c0161290 = page_referenced
>>
>> It looks like the event is kscand running through the pages. I
>> suspected this
>> some time ago, and tried tweaking the kscand_work_percent sysctl
>> variable. It
>> appeared to lower the peak of the spikes, but maybe I imagined it. I
>> believe
>> lowering that value makes kscand wake up more often but do less work
>> (page
>> scanning) each time it is awakened.
>>
>>
>
> What does 'top' in the guest show (perhaps sorted by total cpu time
> rather than instantaneous usage)?
>
> What host kernel are you running? How many host cpus?
>
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-17 21:12 ` David S. Ahern
2008-04-18 7:57 ` Avi Kivity
@ 2008-04-23 8:03 ` Avi Kivity
1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-04-23 8:03 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> kvm_stat -1 is practically impossible to time correctly to get a good snippet.
>
>
I've added a --log option to get vmstat-like output. I've also added
--fields to select which fields are of interest, to avoid the need for
280-column displays. That's now pushed to kvm-userspace.git.
Example:
./kvm_stat -f 'mmu.*|pf.*|remote.*' -l
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-22 20:23 ` David S. Ahern
@ 2008-04-23 8:04 ` Avi Kivity
2008-04-23 15:23 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-23 8:04 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles:
>
> 1. before vcpu->arch.mmu.page_fault()
> 2. after vcpu->arch.mmu.page_fault()
> 3. after mmu_topup_memory_caches()
> 4. after emulate_instruction()
>
> So the delta in the trace reports show:
> - cycles required for arch.mmu.page_fault (tracer 2)
> - cycles required for mmu_topup_memory_caches(tracer 3)
> - cycles required for emulate_instruction() (tracer 4)
>
> I captured trace data for ~5-seconds during one of the usual events (again this
> time it was due to kscand in the guest). I ran the formatted trace data through
> an awk script to summarize:
>
> TSC cycles tracer2 tracer3 tracer4
> 0 - 10,000: 295067 213251 115873
> 10,001 - 25,000: 7682 1004 98336
> 25,001 - 50,000: 201 15 36
> 50,001 - 100,000: 100655 0 10
> > 100,000: 117 0 15
>
> This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl
> 5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it
> took longer than 50,000 cycles. The page_fault function getting run is
> paging64_page_fault.
>
>
This does look like the fork detector. Once in every four faults, it
triggers and the fault becomes slow. 100K floods == 100K page tables ==
200GB of virtual memory, which seems excessive.
Is this running a forked load like apache, with many processes? How
much memory is on the guest, and is there any memory pressure?
> mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times,
> most of them relatively quickly.
> b
> Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few
> host processes could interrupt it.
>
> david
>
>
> Avi Kivity wrote:
>
>> David S. Ahern wrote:
>>
>>> I added the traces and captured data over another apparent lockup of
>>> the guest.
>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>
>>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>> c016127c ]
>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+3632) VMENTRY
>>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>> c016104a ]
>>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000
>>> fffb61c8 ]
>>> (+ 54928) VMENTRY
>>>
>>>
>> Can you oprofile the host to see where the 54K cycles are spent?
>>
>>
>>> (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610e7 ]
>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>>> 41c5d363 ]
>>> (+8432) VMENTRY
>>> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610ee ]
>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db0 ]
>>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>>> 00000000 ]
>>> (+ 13832) VMENTRY
>>>
>>>
>>> (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>> c016127c ]
>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+3712) VMENTRY
>>> (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>> c016104a ]
>>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000
>>> fffb61d0 ]
>>> (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000
>>> 3d55d047 ]
>>>
>>>
>> This indeed has the accessed bit clear.
>>
>>
>>> (+ 65216) VMENTRY
>>> (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610e7 ]
>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>>> 3d598363 ]
>>>
>>>
>> This has the accessed bit set and the user bit clear, and the pte
>> pointing at the previous pte_write gpa. Looks like a kmap_atomic().
>>
>>
>>> (+8640) VMENTRY
>>> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610ee ]
>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db0 ]
>>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>>> 00000000 ]
>>> (+ 14160) VMENTRY
>>>
>>> I can forward a more complete time snippet if you'd like. vcpu0 +
>>> corresponding
>>> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>>>
>>> I did not see the FLOODED trace come out during this sample though I
>>> did bump
>>> the count from 3 to 4 as you suggested.
>>>
>>>
>>>
>>>
>> Bumping the count was supposed to remove the flooding...
>>
>>
>>> Correlating rip addresses to the 2.4 kernel:
>>>
>>> c0160d00-c0161290 = page_referenced
>>>
>>> It looks like the event is kscand running through the pages. I
>>> suspected this
>>> some time ago, and tried tweaking the kscand_work_percent sysctl
>>> variable. It
>>> appeared to lower the peak of the spikes, but maybe I imagined it. I
>>> believe
>>> lowering that value makes kscand wake up more often but do less work
>>> (page
>>> scanning) each time it is awakened.
>>>
>>>
>>>
>> What does 'top' in the guest show (perhaps sorted by total cpu time
>> rather than instantaneous usage)?
>>
>> What host kernel are you running? How many host cpus?
>>
>>
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-23 8:04 ` Avi Kivity
@ 2008-04-23 15:23 ` David S. Ahern
2008-04-23 15:53 ` Avi Kivity
2008-04-25 17:33 ` David S. Ahern
0 siblings, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-23 15:23 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
>> Avi Kivity wrote:
>>
>>> David S. Ahern wrote:
>>>
>>>> I added the traces and captured data over another apparent lockup of
>>>> the guest.
>>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>>
>>>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016127c ]
>>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>>>> c0009db4 ]
>>>> (+3632) VMENTRY
>>>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016104a ]
>>>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000
>>>> fffb61c8 ]
>>>> (+ 54928) VMENTRY
>>>>
>>> Can you oprofile the host to see where the 54K cycles are spent?
>>>
>>>
I've continued drilling down with the tracers to answer that question. I have
done runs with tracers in paging64_page_fault and it showed the overhead is with
the fetch() function. On my last run the tracers are in paging64_fetch() as follows:
1. after is_present_pte() check
2. before kvm_mmu_get_page()
3. after kvm_mmu_get_page()
4. after if (!metaphysical) {}
The delta between 2 and 3 shows the cycles to run kvm_mmu_get_page(). The delta
between 3 and 4 shows the cycles to run kvm_read_guest_atomic(), if it is run.
Tracer1 dumps vcpu->arch.last_pt_write_count (a carryover from when the new
tracers were in paging64_page_fault); tracer2 dumps the level, metaphysical and
access variables; tracer5 dumps value in shadow_ent.
A representative trace sample is:
(+ 4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ]
(+ 2664) PAGE_FAULT1 [ write_count = 0 ]
(+ 472) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ]
(+ 50416) PAGE_FAULT3
(+ 472) PAGE_FAULT4
(+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9276d043 ]
(+ 1528) VMENTRY
(+ 4992) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+ 2296) PAGE_FAULT1 [ write_count = 0 ]
(+ 816) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ]
(+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 4176d363 ]
(+ 6424) VMENTRY
(+ 3864) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+ 2496) PAGE_FAULT1 [ write_count = 1 ]
(+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ]
(+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+ 10248) VMENTRY
(+ 4744) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+ 2408) PAGE_FAULT1 [ write_count = 2 ]
(+ 760) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809043 ]
(+ 1240) VMENTRY
(+ 4624) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ]
(+ 2512) PAGE_FAULT1 [ write_count = 0 ]
(+ 496) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ]
(+ 48664) PAGE_FAULT3
(+ 472) PAGE_FAULT4
(+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9272d043 ]
(+ 1576) VMENTRY
So basically every 4th trip through the fetch function it runs
kvm_mmu_get_page(). A summary of the entire trace file shows this function
rarely executes in less than 50,000 cycles. Also, vcpu->arch.last_pt_write_count
is always 0 when the high cycles are hit.
More tidbits:
- The hugepage option seems to have no effect -- the system spikes and overhead
occurs with and without the hugepage option (above data is with it).
- As the guest runs for hours, the intensity of the spikes drop though they
still occur regularly and kscand continues to be the primary suspect. qemu's RSS
tends to the guests memory allotment of 2GB. Internally guest memory usage runs
at ~1GB page cache, 57M buffers, 24M swap, ~800MB for processes.
- I have looked at process creation and do not see a strong correlation between
system time spikes and number of new processes. So far the only correlations
seem to be kscand and amount of memory used. ie., stock RHEL3 with few processes
shows tiny spikes whereas my tests with 90+ processes using about 800M plus a
continually updating page cache (ie., moderate IO levels) the spikes are strong
and last for seconds.
- Time runs really fast in the guest, gaining several minutes in 24-hours.
I'll download your kvm_stat update and give it a try. When I started this
investigation I was using Christian's kvmstat script which dumped stats to a
file. Plots of that data did not show a strong correlation with guest system time.
david
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-23 15:23 ` David S. Ahern
@ 2008-04-23 15:53 ` Avi Kivity
2008-04-23 16:39 ` David S. Ahern
2008-04-25 17:33 ` David S. Ahern
1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-23 15:53 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> I've continued drilling down with the tracers to answer that question. I have
> done runs with tracers in paging64_page_fault and it showed the overhead is with
> the fetch() function. On my last run the tracers are in paging64_fetch() as follows:
>
> 1. after is_present_pte() check
> 2. before kvm_mmu_get_page()
> 3. after kvm_mmu_get_page()
> 4. after if (!metaphysical) {}
>
> The delta between 2 and 3 shows the cycles to run kvm_mmu_get_page(). The delta
> between 3 and 4 shows the cycles to run kvm_read_guest_atomic(), if it is run.
> Tracer1 dumps vcpu->arch.last_pt_write_count (a carryover from when the new
> tracers were in paging64_page_fault); tracer2 dumps the level, metaphysical and
> access variables; tracer5 dumps value in shadow_ent.
>
> A representative trace sample is:
>
> (+ 4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ]
> (+ 2664) PAGE_FAULT1 [ write_count = 0 ]
> (+ 472) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+ 50416) PAGE_FAULT3
> (+ 472) PAGE_FAULT4
> (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9276d043 ]
> (+ 1528) VMENTRY
> (+ 4992) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+ 2296) PAGE_FAULT1 [ write_count = 0 ]
> (+ 816) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 4176d363 ]
> (+ 6424) VMENTRY
> (+ 3864) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+ 2496) PAGE_FAULT1 [ write_count = 1 ]
> (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+ 10248) VMENTRY
> (+ 4744) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+ 2408) PAGE_FAULT1 [ write_count = 2 ]
> (+ 760) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809043 ]
> (+ 1240) VMENTRY
> (+ 4624) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ]
> (+ 2512) PAGE_FAULT1 [ write_count = 0 ]
> (+ 496) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+ 48664) PAGE_FAULT3
> (+ 472) PAGE_FAULT4
> (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9272d043 ]
> (+ 1576) VMENTRY
>
> So basically every 4th trip through the fetch function it runs
> kvm_mmu_get_page(). A summary of the entire trace file shows this function
> rarely executes in less than 50,000 cycles. Also, vcpu->arch.last_pt_write_count
> is always 0 when the high cycles are hit.
>
>
Ah! The flood detector is not seeing the access through the
kmap_atomic() pte, because that access has gone through the emulator.
last_updated_pte_accessed(vcpu) will never return true.
Can you verify that last_updated_pte_accessed(vcpu) indeed always
returns false?
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-23 15:53 ` Avi Kivity
@ 2008-04-23 16:39 ` David S. Ahern
2008-04-24 17:25 ` David S. Ahern
2008-04-26 6:20 ` Avi Kivity
0 siblings, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-23 16:39 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
Avi Kivity wrote:
>
> Ah! The flood detector is not seeing the access through the
> kmap_atomic() pte, because that access has gone through the emulator.
> last_updated_pte_accessed(vcpu) will never return true.
>
> Can you verify that last_updated_pte_accessed(vcpu) indeed always
> returns false?
>
It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump
the rc of last_updated_pte_accessed(vcpu). ie.,
pte_access = last_updated_pte_accessed(vcpu);
KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler);
A sample:
(+ 4488) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
(+ 2480) PAGE_FAULT1 [ write_count = 0 ]
(+ 424) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ]
(+ 51672) PAGE_FAULT3
(+ 472) PAGE_FAULT4
(+ 704) PAGE_FAULT5 [ shadow_ent = 0x80000001 2dfb5043 ]
(+ 1496) VMENTRY
(+ 4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+ 2352) PAGE_FAULT1 [ write_count = 0 ]
(+ 728) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ]
(+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ]
(+ 0) PTE_ACCESS [ pte_access = 1 ]
(+ 6864) VMENTRY
(+ 3896) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+ 2376) PAGE_FAULT1 [ write_count = 1 ]
(+ 720) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ]
(+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+ 0) PTE_ACCESS [ pte_access = 0 ]
(+ 12344) VMENTRY
(+ 4688) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+ 2416) PAGE_FAULT1 [ write_count = 2 ]
(+ 792) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409043 ]
(+ 1128) VMENTRY
(+ 4512) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
(+ 2448) PAGE_FAULT1 [ write_count = 0 ]
(+ 448) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ]
(+ 51520) PAGE_FAULT3
(+ 432) PAGE_FAULT4
(+ 696) PAGE_FAULT5 [ shadow_ent = 0x80000001 2df5a043 ]
(+ 1480) VMENTRY
david
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-23 16:39 ` David S. Ahern
@ 2008-04-24 17:25 ` David S. Ahern
2008-04-26 6:43 ` Avi Kivity
2008-04-26 6:20 ` Avi Kivity
1 sibling, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-24 17:25 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
What is the rip (GUEST_RIP) value in the VMEXIT trace output? Is that the
current instruction pointer for the guest?
I take it the virt in the PAGE_FAULT trace output is the virtual address the
guest was referencing when the page fault occurred. What I don't understand (one
of many things really) is what the 0xfffb63b0 corresponds to in the guest. Any
ideas?
Also, the expensive page fault occurs on errorcode = 0x0000000b (PAGE_FAULT
trace data). What does the 4th bit in 0xb mean? bit 0 set means
PFERR_PRESENT_MASK is set, and bit 1 means PT_WRITABLE_MASK. What is bit 3?
david
David S. Ahern wrote:
>
> Avi Kivity wrote:
>> Ah! The flood detector is not seeing the access through the
>> kmap_atomic() pte, because that access has gone through the emulator.
>> last_updated_pte_accessed(vcpu) will never return true.
>>
>> Can you verify that last_updated_pte_accessed(vcpu) indeed always
>> returns false?
>>
>
> It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump
> the rc of last_updated_pte_accessed(vcpu). ie.,
> pte_access = last_updated_pte_accessed(vcpu);
> KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler);
>
> A sample:
>
> (+ 4488) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
> (+ 2480) PAGE_FAULT1 [ write_count = 0 ]
> (+ 424) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+ 51672) PAGE_FAULT3
> (+ 472) PAGE_FAULT4
> (+ 704) PAGE_FAULT5 [ shadow_ent = 0x80000001 2dfb5043 ]
> (+ 1496) VMENTRY
> (+ 4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+ 2352) PAGE_FAULT1 [ write_count = 0 ]
> (+ 728) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ]
> (+ 0) PTE_ACCESS [ pte_access = 1 ]
> (+ 6864) VMENTRY
> (+ 3896) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+ 2376) PAGE_FAULT1 [ write_count = 1 ]
> (+ 720) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+ 0) PTE_ACCESS [ pte_access = 0 ]
> (+ 12344) VMENTRY
> (+ 4688) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+ 2416) PAGE_FAULT1 [ write_count = 2 ]
> (+ 792) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409043 ]
> (+ 1128) VMENTRY
> (+ 4512) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
> (+ 2448) PAGE_FAULT1 [ write_count = 0 ]
> (+ 448) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+ 51520) PAGE_FAULT3
> (+ 432) PAGE_FAULT4
> (+ 696) PAGE_FAULT5 [ shadow_ent = 0x80000001 2df5a043 ]
> (+ 1480) VMENTRY
>
>
> david
>
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-23 15:23 ` David S. Ahern
2008-04-23 15:53 ` Avi Kivity
@ 2008-04-25 17:33 ` David S. Ahern
2008-04-26 6:45 ` Avi Kivity
2008-04-28 18:15 ` Marcelo Tosatti
1 sibling, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-25 17:33 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
David S. Ahern wrote:
> Avi Kivity wrote:
>
>> David S. Ahern wrote:
>>
>>> I added the traces and captured data over another apparent lockup of
>>> the guest.
>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>
>>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>> c016127c ]
>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+3632) VMENTRY
>>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>> c016104a ]
>>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000
>>> fffb61c8 ]
>>> (+ 54928) VMENTRY
>>>
>> Can you oprofile the host to see where the 54K cycles are spent?
>>
Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
pte_gpa += (i+offset) * sizeof(pt_element_t);
r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
sizeof(pt_element_t));
if (r || is_present_pte(pt))
sp->spt[i] = shadow_trap_nonpresent_pte;
else
sp->spt[i] = shadow_notrap_nonpresent_pte;
}
This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
loop.
This function gets run >20,000/sec during some of the kscand loops.
david
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-23 16:39 ` David S. Ahern
2008-04-24 17:25 ` David S. Ahern
@ 2008-04-26 6:20 ` Avi Kivity
1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-04-26 6:20 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> Avi Kivity wrote:
>
>> Ah! The flood detector is not seeing the access through the
>> kmap_atomic() pte, because that access has gone through the emulator.
>> last_updated_pte_accessed(vcpu) will never return true.
>>
>> Can you verify that last_updated_pte_accessed(vcpu) indeed always
>> returns false?
>>
>>
>
> It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump
> the rc of last_updated_pte_accessed(vcpu). ie.,
> pte_access = last_updated_pte_accessed(vcpu);
> KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler);
>
> A sample:
>
> (+ 4488) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
> (+ 2480) PAGE_FAULT1 [ write_count = 0 ]
> (+ 424) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+ 51672) PAGE_FAULT3
> (+ 472) PAGE_FAULT4
> (+ 704) PAGE_FAULT5 [ shadow_ent = 0x80000001 2dfb5043 ]
> (+ 1496) VMENTRY
> (+ 4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+ 2352) PAGE_FAULT1 [ write_count = 0 ]
> (+ 728) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ]
> (+ 0) PTE_ACCESS [ pte_access = 1 ]
> (+ 6864) VMENTRY
> (+ 3896) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+ 2376) PAGE_FAULT1 [ write_count = 1 ]
> (+ 720) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ]
> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+ 0) PTE_ACCESS [ pte_access = 0 ]
> (+ 12344) VMENTRY
> (+ 4688) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+ 2416) PAGE_FAULT1 [ write_count = 2 ]
> (+ 792) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409043 ]
> (+ 1128) VMENTRY
> (+ 4512) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ]
> (+ 2448) PAGE_FAULT1 [ write_count = 0 ]
> (+ 448) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ]
> (+ 51520) PAGE_FAULT3
> (+ 432) PAGE_FAULT4
> (+ 696) PAGE_FAULT5 [ shadow_ent = 0x80000001 2df5a043 ]
> (+ 1480) VMENTRY
>
>
Strange... there should be at least two pte_access = 0 traces in there
before flooding can occur, according to my reading of the code. The
counter needs to go up to 3 somehow.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-24 17:25 ` David S. Ahern
@ 2008-04-26 6:43 ` Avi Kivity
0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-04-26 6:43 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> What is the rip (GUEST_RIP) value in the VMEXIT trace output? Is that the
> current instruction pointer for the guest?
>
>
Yes.
> I take it the virt in the PAGE_FAULT trace output is the virtual address the
> guest was referencing when the page fault occurred. What I don't understand (one
> of many things really) is what the 0xfffb63b0 corresponds to in the guest. Any
> ideas?
>
>
I'm pretty sure it is the kmap_atomic() pte. The guest wants to update
a pte (call it pte1), which is in HIGHMEM, so it doesn't have a
permanent mapping for it. It calls kmap_atomic() which sets up another
pte (pte2, two writes), and then accesses pte1 through pte2.
> Also, the expensive page fault occurs on errorcode = 0x0000000b (PAGE_FAULT
> trace data). What does the 4th bit in 0xb mean? bit 0 set means
> PFERR_PRESENT_MASK is set, and bit 1 means PT_WRITABLE_MASK. What is bit 3?
>
Bit 3 is the reserved bit, which means the shadow pte has an illegal bit
combination. kvm sets up vmx to forward non-persent page faults (bit 0
clear) directly to the guest, so it needs some other pattern to get a
trapping fault.
IOW, there are two types of non-present shadow ptes in kvm: trapping
ones (where we don't know what the guest pte looks like) and nontrapping
ones (where we know the guest pte is not present, so we forward the
fault directly to the guest). The first type is encoded with the
reserved bit and present bit set, the second with both of them clear.
You can disable this trickery using the bypass_guest_pf module
parameter. It should be useful to try it, we'll see the forwarded
faults as well.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-25 17:33 ` David S. Ahern
@ 2008-04-26 6:45 ` Avi Kivity
2008-04-28 18:15 ` Marcelo Tosatti
1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-04-26 6:45 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> David S. Ahern wrote:
>
>> Avi Kivity wrote:
>>
>>
>>> David S. Ahern wrote:
>>>
>>>
>>>> I added the traces and captured data over another apparent lockup of
>>>> the guest.
>>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>>
>>>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016127c ]
>>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000
>>>> c0009db4 ]
>>>> (+3632) VMENTRY
>>>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016104a ]
>>>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000
>>>> fffb61c8 ]
>>>> (+ 54928) VMENTRY
>>>>
>>>>
>>> Can you oprofile the host to see where the 54K cycles are spent?
>>>
>>>
>
> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
>
> for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
> gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
> pte_gpa += (i+offset) * sizeof(pt_element_t);
>
> r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
> sizeof(pt_element_t));
> if (r || is_present_pte(pt))
> sp->spt[i] = shadow_trap_nonpresent_pte;
> else
> sp->spt[i] = shadow_notrap_nonpresent_pte;
> }
>
> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
> loop.
>
> This function gets run >20,000/sec during some of the kscand loops.
>
>
We really ought to optimize it. That's second order however. The real
fix is making sure it isn't called so often.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-25 17:33 ` David S. Ahern
2008-04-26 6:45 ` Avi Kivity
@ 2008-04-28 18:15 ` Marcelo Tosatti
2008-04-28 23:45 ` David S. Ahern
1 sibling, 1 reply; 73+ messages in thread
From: Marcelo Tosatti @ 2008-04-28 18:15 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel, Avi Kivity
On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote:
> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
>
> for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
> gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
> pte_gpa += (i+offset) * sizeof(pt_element_t);
>
> r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
> sizeof(pt_element_t));
> if (r || is_present_pte(pt))
> sp->spt[i] = shadow_trap_nonpresent_pte;
> else
> sp->spt[i] = shadow_notrap_nonpresent_pte;
> }
>
> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
> loop.
>
> This function gets run >20,000/sec during some of the kscand loops.
Hi David,
Do you see the mmu_recycled counter increase?
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-28 18:15 ` Marcelo Tosatti
@ 2008-04-28 23:45 ` David S. Ahern
2008-04-30 4:18 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-28 23:45 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: kvm-devel, Avi Kivity
Hi Marcelo:
mmu_recycled is always 0 for this guest -- even after almost 4 hours of uptime.
Here is a kvm_stat sample where guest time was very high and qemu had 2
processors at 100% on the host. I removed counters where both columns have 0
value for brevity.
exits 45937979 758051
fpu_reload 1416831 87
halt_exits 112911 0
halt_wakeup 31771 0
host_state_reload 2068602 263
insn_emulation 21601480 365493
io_exits 1827374 2705
irq_exits 8934818 285196
mmio_exits 421674 147
mmu_cache_miss 4817689 93680
mmu_flooded 4815273 93680
mmu_pde_zapped 51344 0
mmu_prefetch 4817625 93680
mmu_pte_updated 14803298 270104
mmu_pte_write 19859863 363785
mmu_shadow_zapped 4832106 93679
pf_fixed 32184355 468398
pf_guest 264138 0
remote_tlb_flush 10697762 280522
tlb_flush 10301338 176424
(NOTE: This is for a *5* second sample interval instead of 1 to allow me to
capture the data).
Here's a sample when the guest is "well-behaved" (system time <10%, though ):
exits 51502194 97453
fpu_reload 1421736 227
halt_exits 138361 1927
halt_wakeup 33047 117
host_state_reload 2110190 3740
insn_emulation 24367441 47260
io_exits 1874075 2576
irq_exits 10224702 13333
mmio_exits 435154 1726
mmu_cache_miss 5414097 11258
mmu_flooded 5411548 11243
mmu_pde_zapped 52851 44
mmu_prefetch 5414031 11258
mmu_pte_updated 16854686 29901
mmu_pte_write 22526765 42285
mmu_shadow_zapped 5430025 11313
pf_fixed 36144578 67666
pf_guest 282794 430
remote_tlb_flush 12126268 14619
tlb_flush 11753162 21460
There is definitely a strong correlation between the mmu counters and high
system times in the guest. I am still trying to find out what in the guest is
stimulating it when running on RHEL3; I do not see this same behavior for an
equivalent setup running on RHEL4.
By the way I added an mmu_prefetch stat in prefetch_page() to count the number
of times the for() loop is hit with PTTYPE == 64; ie., number of times
paging64_prefetch_page() is invoked. (I wanted an explicit counter for this
loop, though the info seems to duplicate other entries.) That counter is listed
above. As I mentioned in a prior post when kscand kicks in the change in
mmu_prefetch counter is at 20,000+/sec, with each trip through that function
taking 45k+ cycles.
kscand is an instigator shortly after boot, however, kscand is *not* the culprit
once the system has been up for 30-45 minutes. I have started instrumenting the
RHEL3U8 kernel and for the load I am running kscand does not walk the active
lists very often once the system is up.
So, to dig deeper on what in the guest is stimulating the mmu I collected
kvmtrace data for about a 2 minute time interval which caught about a 30-second
period where guest system time was steady in the 25-30% range. Summarizing the
number of times a RIP appears in an VMEXIT shows the following high runners:
count RIP RHEL3-symbol
82549 0xc0140e42 follow_page [kernel] c0140d90 offset b2
42532 0xc0144760 handle_mm_fault [kernel] c01446d0 offset 90
36826 0xc013da4a futex_wait [kernel] c013d870 offset 1da
29987 0xc0145cd0 zap_pte_range [kernel] c0145c10 offset c0
27451 0xc0144018 do_no_page [kernel] c0143e20 offset 1f8
(halt entry removed the list since that is the ideal scenario for an exit).
So the RIP correlates to follow_page() for a large percentage of the VMEXITs.
I wrote an awk script to summarize (histogram style) the TSC cycles between
VMEXIT and VMENTRY for an address. For the first rip, 0xc0140e42, 82,271 times
(ie., almost 100% of the time) the trace shows a delta between 50k and 100k
cycles between the VMEXIT and the subsequent VMENTRY. Similarly for the second
one, 0xc0144760, 42403 times (again almost 100% of the occurrences) the trace
shows a delta between 50k and 100k cycles between VMEXIT and VMENTRY. These
seems to correlate with the prefetch_page function in kvm, though I am not 100%
positive on that.
I am now investigating the kernel paths leading to those functions. Any insights
would definitely be appreciated.
david
Marcelo Tosatti wrote:
> On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote:
>> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
>>
>> for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>> gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
>> pte_gpa += (i+offset) * sizeof(pt_element_t);
>>
>> r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
>> sizeof(pt_element_t));
>> if (r || is_present_pte(pt))
>> sp->spt[i] = shadow_trap_nonpresent_pte;
>> else
>> sp->spt[i] = shadow_notrap_nonpresent_pte;
>> }
>>
>> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
>> loop.
>>
>> This function gets run >20,000/sec during some of the kscand loops.
>
> Hi David,
>
> Do you see the mmu_recycled counter increase?
>
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-28 23:45 ` David S. Ahern
@ 2008-04-30 4:18 ` David S. Ahern
2008-04-30 9:55 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-04-30 4:18 UTC (permalink / raw)
To: Marcelo Tosatti, Avi Kivity; +Cc: kvm-devel
Another tidbit for you guys as I make my way through various permutations:
I installed the RHEL3 hugemem kernel and the guest behavior is *much* better.
System time still has some regular hiccups that are higher than xen and esx
(e.g., 1 minute samples out of 5 show system time between 10 and 15%), but
overall guest behavior is good with the hugemem kernel.
One side effect I've noticed is that I cannot restart the RHEL3 guest running
the hugemem kernel in successive attempts. The guest has 2 vcpus and qemu shows
one thread at 100% cpu. If I recall correctly kvm_stat shows a large amount of
tlb_flushes (like millions in a 5-second sample). The scenario is:
1. start guest running hugemem kernel,
2. shutdown,
3. restart guest.
During 3. it hangs, but at random points. Removing kvm/kvm-intel has no effect -
guest still hangs on the restart. Rebooting the host clears the problem.
Alternatively, during the hang on a restart I can kill the guest, and then on
restart choose the normal, 32-bit smp kernel and the guest boots just fine. At
this point I can shutdown the guest and restart with the hugemem kernel and it
boots just fine.
david
David S. Ahern wrote:
> Hi Marcelo:
>
> mmu_recycled is always 0 for this guest -- even after almost 4 hours of uptime.
>
> Here is a kvm_stat sample where guest time was very high and qemu had 2
> processors at 100% on the host. I removed counters where both columns have 0
> value for brevity.
>
> exits 45937979 758051
> fpu_reload 1416831 87
> halt_exits 112911 0
> halt_wakeup 31771 0
> host_state_reload 2068602 263
> insn_emulation 21601480 365493
> io_exits 1827374 2705
> irq_exits 8934818 285196
> mmio_exits 421674 147
> mmu_cache_miss 4817689 93680
> mmu_flooded 4815273 93680
> mmu_pde_zapped 51344 0
> mmu_prefetch 4817625 93680
> mmu_pte_updated 14803298 270104
> mmu_pte_write 19859863 363785
> mmu_shadow_zapped 4832106 93679
> pf_fixed 32184355 468398
> pf_guest 264138 0
> remote_tlb_flush 10697762 280522
> tlb_flush 10301338 176424
>
> (NOTE: This is for a *5* second sample interval instead of 1 to allow me to
> capture the data).
>
> Here's a sample when the guest is "well-behaved" (system time <10%, though ):
> exits 51502194 97453
> fpu_reload 1421736 227
> halt_exits 138361 1927
> halt_wakeup 33047 117
> host_state_reload 2110190 3740
> insn_emulation 24367441 47260
> io_exits 1874075 2576
> irq_exits 10224702 13333
> mmio_exits 435154 1726
> mmu_cache_miss 5414097 11258
> mmu_flooded 5411548 11243
> mmu_pde_zapped 52851 44
> mmu_prefetch 5414031 11258
> mmu_pte_updated 16854686 29901
> mmu_pte_write 22526765 42285
> mmu_shadow_zapped 5430025 11313
> pf_fixed 36144578 67666
> pf_guest 282794 430
> remote_tlb_flush 12126268 14619
> tlb_flush 11753162 21460
>
>
> There is definitely a strong correlation between the mmu counters and high
> system times in the guest. I am still trying to find out what in the guest is
> stimulating it when running on RHEL3; I do not see this same behavior for an
> equivalent setup running on RHEL4.
>
> By the way I added an mmu_prefetch stat in prefetch_page() to count the number
> of times the for() loop is hit with PTTYPE == 64; ie., number of times
> paging64_prefetch_page() is invoked. (I wanted an explicit counter for this
> loop, though the info seems to duplicate other entries.) That counter is listed
> above. As I mentioned in a prior post when kscand kicks in the change in
> mmu_prefetch counter is at 20,000+/sec, with each trip through that function
> taking 45k+ cycles.
>
> kscand is an instigator shortly after boot, however, kscand is *not* the culprit
> once the system has been up for 30-45 minutes. I have started instrumenting the
> RHEL3U8 kernel and for the load I am running kscand does not walk the active
> lists very often once the system is up.
>
> So, to dig deeper on what in the guest is stimulating the mmu I collected
> kvmtrace data for about a 2 minute time interval which caught about a 30-second
> period where guest system time was steady in the 25-30% range. Summarizing the
> number of times a RIP appears in an VMEXIT shows the following high runners:
>
> count RIP RHEL3-symbol
> 82549 0xc0140e42 follow_page [kernel] c0140d90 offset b2
> 42532 0xc0144760 handle_mm_fault [kernel] c01446d0 offset 90
> 36826 0xc013da4a futex_wait [kernel] c013d870 offset 1da
> 29987 0xc0145cd0 zap_pte_range [kernel] c0145c10 offset c0
> 27451 0xc0144018 do_no_page [kernel] c0143e20 offset 1f8
>
> (halt entry removed the list since that is the ideal scenario for an exit).
>
> So the RIP correlates to follow_page() for a large percentage of the VMEXITs.
>
> I wrote an awk script to summarize (histogram style) the TSC cycles between
> VMEXIT and VMENTRY for an address. For the first rip, 0xc0140e42, 82,271 times
> (ie., almost 100% of the time) the trace shows a delta between 50k and 100k
> cycles between the VMEXIT and the subsequent VMENTRY. Similarly for the second
> one, 0xc0144760, 42403 times (again almost 100% of the occurrences) the trace
> shows a delta between 50k and 100k cycles between VMEXIT and VMENTRY. These
> seems to correlate with the prefetch_page function in kvm, though I am not 100%
> positive on that.
>
> I am now investigating the kernel paths leading to those functions. Any insights
> would definitely be appreciated.
>
> david
>
>
> Marcelo Tosatti wrote:
>> On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote:
>>> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
>>>
>>> for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>>> gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
>>> pte_gpa += (i+offset) * sizeof(pt_element_t);
>>>
>>> r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
>>> sizeof(pt_element_t));
>>> if (r || is_present_pte(pt))
>>> sp->spt[i] = shadow_trap_nonpresent_pte;
>>> else
>>> sp->spt[i] = shadow_notrap_nonpresent_pte;
>>> }
>>>
>>> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
>>> loop.
>>>
>>> This function gets run >20,000/sec during some of the kscand loops.
>> Hi David,
>>
>> Do you see the mmu_recycled counter increase?
>>
>
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-30 4:18 ` David S. Ahern
@ 2008-04-30 9:55 ` Avi Kivity
2008-04-30 13:39 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-30 9:55 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti
David S. Ahern wrote:
> Another tidbit for you guys as I make my way through various permutations:
> I installed the RHEL3 hugemem kernel and the guest behavior is *much* better.
> System time still has some regular hiccups that are higher than xen and esx
> (e.g., 1 minute samples out of 5 show system time between 10 and 15%), but
> overall guest behavior is good with the hugemem kernel.
>
>
Wait, the amount of info here is overwhelming. Let's stick with the
current kernel (32-bit, HIGHMEM4G, right?)
Did you get any traces with bypass_guest_pf=0? That may show more info.
--
Any sufficiently difficult bug is indistinguishable from a feature.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-30 9:55 ` Avi Kivity
@ 2008-04-30 13:39 ` David S. Ahern
2008-04-30 13:49 ` Avi Kivity
2008-04-30 13:56 ` Daniel P. Berrange
0 siblings, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-30 13:39 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel, Marcelo Tosatti
Avi Kivity wrote:
> David S. Ahern wrote:
>> Another tidbit for you guys as I make my way through various
>> permutations:
>> I installed the RHEL3 hugemem kernel and the guest behavior is *much*
>> better.
>> System time still has some regular hiccups that are higher than xen
>> and esx
>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%),
>> but
>> overall guest behavior is good with the hugemem kernel.
>>
>>
>
> Wait, the amount of info here is overwhelming. Let's stick with the
> current kernel (32-bit, HIGHMEM4G, right?)
>
> Did you get any traces with bypass_guest_pf=0? That may show more info.
>
My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest.
My point in the last email was that the hugemem kernel shows a remarkable
difference (it uses 3-levels of page tables right?). I was hoping that would
ring a bell with someone.
Adding bypass_guest_pf=0 did not improve the situation. Did you want anything
particular with that setting -- like a RIP summary or a summary of exit-entry
cycles?
david
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-30 13:39 ` David S. Ahern
@ 2008-04-30 13:49 ` Avi Kivity
2008-05-11 12:32 ` Avi Kivity
2008-04-30 13:56 ` Daniel P. Berrange
1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-04-30 13:49 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti
David S. Ahern wrote:
> Avi Kivity wrote:
>
>> David S. Ahern wrote:
>>
>>> Another tidbit for you guys as I make my way through various
>>> permutations:
>>> I installed the RHEL3 hugemem kernel and the guest behavior is *much*
>>> better.
>>> System time still has some regular hiccups that are higher than xen
>>> and esx
>>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%),
>>> but
>>> overall guest behavior is good with the hugemem kernel.
>>>
>>>
>>>
>> Wait, the amount of info here is overwhelming. Let's stick with the
>> current kernel (32-bit, HIGHMEM4G, right?)
>>
>> Did you get any traces with bypass_guest_pf=0? That may show more info.
>>
>>
>
> My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest.
>
Me too. I would like to see all reasonable guests supported well,
without performance issues, and not have to tell the use which kernel to
use.
> My point in the last email was that the hugemem kernel shows a remarkable
> difference (it uses 3-levels of page tables right?). I was hoping that would
> ring a bell with someone.
>
From the traces I saw I think the standard kernel is pae as well. Can
you verify? I think it's CONFIG_HIGHMEM4G (instead of
CONFIG_HIGHMEM64G) but that option may be different for such an old kernel.
> Adding bypass_guest_pf=0 did not improve the situation. Did you want anything
> particular with that setting -- like a RIP summary or a summary of exit-entry
> cycles?
>
I asked fo this thinking bypass_guest_pf may help show more
information. But thinking a bit more, it will not.
I think I do know what the problem is. I will try it out. Is there a
free clone (like centos) available somewhere?
--
Any sufficiently difficult bug is indistinguishable from a feature.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-30 13:39 ` David S. Ahern
2008-04-30 13:49 ` Avi Kivity
@ 2008-04-30 13:56 ` Daniel P. Berrange
2008-04-30 14:23 ` David S. Ahern
1 sibling, 1 reply; 73+ messages in thread
From: Daniel P. Berrange @ 2008-04-30 13:56 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti, Avi Kivity
On Wed, Apr 30, 2008 at 07:39:53AM -0600, David S. Ahern wrote:
> Avi Kivity wrote:
> > David S. Ahern wrote:
> >> Another tidbit for you guys as I make my way through various
> >> permutations:
> >> I installed the RHEL3 hugemem kernel and the guest behavior is *much*
> >> better.
> >> System time still has some regular hiccups that are higher than xen
> >> and esx
> >> (e.g., 1 minute samples out of 5 show system time between 10 and 15%),
> >> but
> >> overall guest behavior is good with the hugemem kernel.
> >>
> >>
> >
> > Wait, the amount of info here is overwhelming. Let's stick with the
> > current kernel (32-bit, HIGHMEM4G, right?)
> >
> > Did you get any traces with bypass_guest_pf=0? That may show more info.
> >
>
> My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest.
> My point in the last email was that the hugemem kernel shows a remarkable
> difference (it uses 3-levels of page tables right?). I was hoping that would
> ring a bell with someone.
IIRC, the RHEL-3 hugemem kernel is using the 4g/4g split patches which
give userspace and kernelspace their own independant pagetables
http://lwn.net/Articles/39925/
http://lwn.net/Articles/39283/
Dan.
--
|: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-30 13:56 ` Daniel P. Berrange
@ 2008-04-30 14:23 ` David S. Ahern
0 siblings, 0 replies; 73+ messages in thread
From: David S. Ahern @ 2008-04-30 14:23 UTC (permalink / raw)
To: Daniel P. Berrange, Avi Kivity; +Cc: kvm-devel, Marcelo Tosatti
Yes, the 4G/4G patch and the 64G options are both enabled for the hugemem kernel:
CONFIG_HIGHMEM64G=y
CONFIG_X86_4G=y
Differences between the "standard" kernel and the hugemem kernel:
# diff config-2.4.21-47.ELsmp config-2.4.21-47.ELhugemem
2157,2158c2157,2158
< CONFIG_M686=y
< # CONFIG_MPENTIUMIII is not set
---
> # CONFIG_M686 is not set
> CONFIG_MPENTIUMIII=y
2169c2169
< CONFIG_X86_PGE=y
---
> # CONFIG_X86_PGE is not set
2193c2193
< # CONFIG_X86_4G is not set
---
> CONFIG_X86_4G=y
2365,2366c2365
< CONFIG_M686=y
< CONFIG_X86_PGE=y
---
> CONFIG_MPENTIUMIII=y
2369,2372d2367
< # CONFIG_MXT is not set
< CONFIG_HOTPLUG_PCI=y
< CONFIG_HOTPLUG_PCI_COMPAQ=m
< CONFIG_HOTPLUG_PCI_IBM=m
2373a2369
> CONFIG_X86_4G=y
2377,2379d2372
< # CONFIG_EWRK3 is not set
< CONFIG_UNIX98_PTY_COUNT=2048
< CONFIG_HZ=512
2382a2376,2383
> # CONFIG_MXT is not set
> CONFIG_HOTPLUG_PCI=y
> CONFIG_HOTPLUG_PCI_COMPAQ=m
> CONFIG_HOTPLUG_PCI_IBM=m
> # CONFIG_EWRK3 is not set
> CONFIG_UNIX98_PTY_COUNT=2048
> CONFIG_DEBUG_BUGVERBOSE=y
> # CONFIG_PNPBIOS is not set
Avi:
Centos releases:
http://isoredirect.centos.org/centos/3/isos/i386/
I am running RHEL3.8 which I do not see listed. Also, I'll need to work on a
stock install and try to capture some kind of workload that exhibits the
problem. It will be a couple of days.
david
Daniel P. Berrange wrote:
> On Wed, Apr 30, 2008 at 07:39:53AM -0600, David S. Ahern wrote:
>> Avi Kivity wrote:
>>> David S. Ahern wrote:
>>>> Another tidbit for you guys as I make my way through various
>>>> permutations:
>>>> I installed the RHEL3 hugemem kernel and the guest behavior is *much*
>>>> better.
>>>> System time still has some regular hiccups that are higher than xen
>>>> and esx
>>>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%),
>>>> but
>>>> overall guest behavior is good with the hugemem kernel.
>>>>
>>>>
>>> Wait, the amount of info here is overwhelming. Let's stick with the
>>> current kernel (32-bit, HIGHMEM4G, right?)
>>>
>>> Did you get any traces with bypass_guest_pf=0? That may show more info.
>>>
>> My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest.
>> My point in the last email was that the hugemem kernel shows a remarkable
>> difference (it uses 3-levels of page tables right?). I was hoping that would
>> ring a bell with someone.
>
> IIRC, the RHEL-3 hugemem kernel is using the 4g/4g split patches which
> give userspace and kernelspace their own independant pagetables
>
> http://lwn.net/Articles/39925/
> http://lwn.net/Articles/39283/
>
> Dan.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-04-30 13:49 ` Avi Kivity
@ 2008-05-11 12:32 ` Avi Kivity
2008-05-11 13:36 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-11 12:32 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti
[-- Attachment #1: Type: text/plain, Size: 602 bytes --]
Avi Kivity wrote:
>
> I asked fo this thinking bypass_guest_pf may help show more
> information. But thinking a bit more, it will not.
>
> I think I do know what the problem is. I will try it out. Is there a
> free clone (like centos) available somewhere?
This patch tracks down emulated accesses to speculated ptes and marks
them as accessed, preventing the flooding on centos-3.1. Unfortunately
it also causes a host oops midway through the boot process.
I believe the oops is merely exposed by the patch, not caused by it.
--
error compiling committee.c: too many arguments to function
[-- Attachment #2: prevent-kscand-flooding.patch --]
[-- Type: text/x-patch, Size: 2435 bytes --]
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3d769c3..8c1e7f3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1127,8 +1127,10 @@ unshadowed:
else
kvm_release_pfn_clean(pfn);
}
- if (!ptwrite || !*ptwrite)
+ if (speculative) {
vcpu->arch.last_pte_updated = shadow_pte;
+ vcpu->arch.last_pte_gfn = gfn;
+ }
}
static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -1674,6 +1676,17 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
vcpu->arch.update_pte.pfn = pfn;
}
+static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+ u64 *spte = vcpu->arch.last_pte_updated;
+
+ if (spte
+ && vcpu->arch.last_pte_gfn == gfn
+ && shadow_accessed_mask
+ && !(*spte & shadow_accessed_mask))
+ set_bit(PT_ACCESSED_SHIFT, spte);
+}
+
void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
const u8 *new, int bytes)
{
@@ -1697,13 +1710,14 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes);
spin_lock(&vcpu->kvm->mmu_lock);
+ kvm_mmu_access_page(vcpu, gfn);
kvm_mmu_free_some_pages(vcpu);
++vcpu->kvm->stat.mmu_pte_write;
kvm_mmu_audit(vcpu, "pre pte write");
if (gfn == vcpu->arch.last_pt_write_gfn
&& !last_updated_pte_accessed(vcpu)) {
++vcpu->arch.last_pt_write_count;
- if (vcpu->arch.last_pt_write_count >= 3)
+ if (vcpu->arch.last_pt_write_count >= 4)
flooded = 1;
} else {
vcpu->arch.last_pt_write_gfn = gfn;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 1730757..258e5d5 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -15,7 +15,8 @@
#define PT_USER_MASK (1ULL << 2)
#define PT_PWT_MASK (1ULL << 3)
#define PT_PCD_MASK (1ULL << 4)
-#define PT_ACCESSED_MASK (1ULL << 5)
+#define PT_ACCESSED_SHIFT 5
+#define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
#define PT_DIRTY_MASK (1ULL << 6)
#define PT_PAGE_SIZE_MASK (1ULL << 7)
#define PT_PAT_MASK (1ULL << 7)
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 1d8cd01..0bdb392 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -242,6 +242,7 @@ struct kvm_vcpu_arch {
gfn_t last_pt_write_gfn;
int last_pt_write_count;
u64 *last_pte_updated;
+ gfn_t last_pte_gfn;
struct {
gfn_t gfn; /* presumed gfn during guest pte update */
[-- Attachment #3: Type: text/plain, Size: 320 bytes --]
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
[-- Attachment #4: Type: text/plain, Size: 158 bytes --]
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-11 12:32 ` Avi Kivity
@ 2008-05-11 13:36 ` Avi Kivity
2008-05-13 3:49 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-11 13:36 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti
[-- Attachment #1: Type: text/plain, Size: 706 bytes --]
Avi Kivity wrote:
> Avi Kivity wrote:
>>
>> I asked fo this thinking bypass_guest_pf may help show more
>> information. But thinking a bit more, it will not.
>>
>> I think I do know what the problem is. I will try it out. Is there
>> a free clone (like centos) available somewhere?
>
> This patch tracks down emulated accesses to speculated ptes and marks
> them as accessed, preventing the flooding on centos-3.1.
> Unfortunately it also causes a host oops midway through the boot process.
>
> I believe the oops is merely exposed by the patch, not caused by it.
>
It was caused by the patch, please try the updated one attached.
--
error compiling committee.c: too many arguments to function
[-- Attachment #2: prevent-kscand-flooding.patch --]
[-- Type: text/x-patch, Size: 2473 bytes --]
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3d769c3..012e8ad 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1127,8 +1127,10 @@ unshadowed:
else
kvm_release_pfn_clean(pfn);
}
- if (!ptwrite || !*ptwrite)
+ if (speculative) {
vcpu->arch.last_pte_updated = shadow_pte;
+ vcpu->arch.last_pte_gfn = gfn;
+ }
}
static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -1674,6 +1676,18 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
vcpu->arch.update_pte.pfn = pfn;
}
+static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+ u64 *spte = vcpu->arch.last_pte_updated;
+
+ if (spte
+ && vcpu->arch.last_pte_gfn == gfn
+ && shadow_accessed_mask
+ && !(*spte & shadow_accessed_mask)
+ && is_shadow_present_pte(*spte))
+ set_bit(PT_ACCESSED_SHIFT, spte);
+}
+
void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
const u8 *new, int bytes)
{
@@ -1697,13 +1711,14 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes);
spin_lock(&vcpu->kvm->mmu_lock);
+ kvm_mmu_access_page(vcpu, gfn);
kvm_mmu_free_some_pages(vcpu);
++vcpu->kvm->stat.mmu_pte_write;
kvm_mmu_audit(vcpu, "pre pte write");
if (gfn == vcpu->arch.last_pt_write_gfn
&& !last_updated_pte_accessed(vcpu)) {
++vcpu->arch.last_pt_write_count;
- if (vcpu->arch.last_pt_write_count >= 3)
+ if (vcpu->arch.last_pt_write_count >= 5)
flooded = 1;
} else {
vcpu->arch.last_pt_write_gfn = gfn;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 1730757..258e5d5 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -15,7 +15,8 @@
#define PT_USER_MASK (1ULL << 2)
#define PT_PWT_MASK (1ULL << 3)
#define PT_PCD_MASK (1ULL << 4)
-#define PT_ACCESSED_MASK (1ULL << 5)
+#define PT_ACCESSED_SHIFT 5
+#define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
#define PT_DIRTY_MASK (1ULL << 6)
#define PT_PAGE_SIZE_MASK (1ULL << 7)
#define PT_PAT_MASK (1ULL << 7)
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 1d8cd01..0bdb392 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -242,6 +242,7 @@ struct kvm_vcpu_arch {
gfn_t last_pt_write_gfn;
int last_pt_write_count;
u64 *last_pte_updated;
+ gfn_t last_pte_gfn;
struct {
gfn_t gfn; /* presumed gfn during guest pte update */
[-- Attachment #3: Type: text/plain, Size: 320 bytes --]
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
[-- Attachment #4: Type: text/plain, Size: 158 bytes --]
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-11 13:36 ` Avi Kivity
@ 2008-05-13 3:49 ` David S. Ahern
2008-05-13 7:25 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-13 3:49 UTC (permalink / raw)
To: Avi Kivity, kvm-devel
That does the trick with kscand.
Do you have recommendations for clock source settings? For example in my
test case for this patch the guest gained 73 seconds (ahead of real
time) after only 3 hours, 5 min of uptime.
thanks,
david
Avi Kivity wrote:
> Avi Kivity wrote:
>> Avi Kivity wrote:
>>>
>>> I asked fo this thinking bypass_guest_pf may help show more
>>> information. But thinking a bit more, it will not.
>>>
>>> I think I do know what the problem is. I will try it out. Is there
>>> a free clone (like centos) available somewhere?
>>
>> This patch tracks down emulated accesses to speculated ptes and marks
>> them as accessed, preventing the flooding on centos-3.1.
>> Unfortunately it also causes a host oops midway through the boot process.
>>
>> I believe the oops is merely exposed by the patch, not caused by it.
>>
>
> It was caused by the patch, please try the updated one attached.
>
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save $100.
> Use priority code J8TL2D2.
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> kvm-devel mailing list
> kvm-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-13 3:49 ` David S. Ahern
@ 2008-05-13 7:25 ` Avi Kivity
2008-05-14 20:35 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-13 7:25 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> That does the trick with kscand.
>
>
Not so fast... the patch updates the flood count to 5. Can you check
if a lower value still works? Also, whether updating the flood count to
5 (without the rest of the patch) works?
Unconditionally bumping the flood count to 5 will likely cause a
performance regression on other guests.
While I was able to see excessive flooding, I couldn't reproduce your
kscand problem. Running /bin/true always returned immediately for me.
> Do you have recommendations for clock source settings? For example in my
> test case for this patch the guest gained 73 seconds (ahead of real
> time) after only 3 hours, 5 min of uptime.
>
The kernel is trying to correlate tsc and pit, which isn't going to work.
Try disabling the tsc, set edx.bit4=0 for cpuid.eax=1 in qemu-kvm-x86
.c do_cpuid_ent().
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-13 7:25 ` Avi Kivity
@ 2008-05-14 20:35 ` David S. Ahern
2008-05-15 10:53 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-14 20:35 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
Avi Kivity wrote:
> Not so fast... the patch updates the flood count to 5. Can you check
> if a lower value still works? Also, whether updating the flood count to
> 5 (without the rest of the patch) works?
>
> Unconditionally bumping the flood count to 5 will likely cause a
> performance regression on other guests.
I put the flood count back to 3, and the RHEL3 guest performance is even
better.
>
> While I was able to see excessive flooding, I couldn't reproduce your
> kscand problem. Running /bin/true always returned immediately for me.
A poor attempt at finding a simplistic, minimal re-create. The use case
I am investigating has over 500 processes/threads with a base memory
consumption around 1GB. I was finding it nearly impossible to have a
generic re-create of the problem for you to use in your investigations
on CentOS.
Thanks for the patch.
david
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-14 20:35 ` David S. Ahern
@ 2008-05-15 10:53 ` Avi Kivity
2008-05-17 4:31 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-15 10:53 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm-devel
David S. Ahern wrote:
> Avi Kivity wrote:
>
>> Not so fast... the patch updates the flood count to 5. Can you check
>> if a lower value still works? Also, whether updating the flood count to
>> 5 (without the rest of the patch) works?
>>
>> Unconditionally bumping the flood count to 5 will likely cause a
>> performance regression on other guests.
>>
>
> I put the flood count back to 3, and the RHEL3 guest performance is even
> better.
>
>
Okay, I committed the patch without the flood count == 5.
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-15 10:53 ` Avi Kivity
@ 2008-05-17 4:31 ` David S. Ahern
[not found] ` <482FCEE1.5040306@qumranet.com>
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-17 4:31 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
[-- Attachment #1: Type: text/plain, Size: 1092 bytes --]
Avi Kivity wrote:
>
> Okay, I committed the patch without the flood count == 5.
>
I've continued testing the RHEL3 guests with the flood count at 3, and I
am right back to where I started. With the patch and the flood count at
3, I had 2 runs totaling around 24 hours that looked really good. Now, I
am back to square one. I guess the short of it is that I am not sure if
the patch resolves this issue or not.
If you want to back it out, I can continue to apply on my end as I
continue testing. A snapshot of kvm_stat -f 'mmu*' -l is attached for
two test runs with the patch (line wrap is horrible inline).
I will work on creating an ap that will stimulate kscand activity
similar to what I am seeing.
Also, in a prior e-mail I mentioned guest time advancing rapidly. I've
noticed that with the -no-kvm-pit option the guest time is much better
and typically stays within 3 seconds or so of the host, even through the
high kscand activity which is one instance of when I've noticed time
jumps with the kernel pit. Yes, this result has been repeatable through
6 or so runs. :-)
david
[-- Attachment #2: kvm-stats-rhel3 --]
[-- Type: text/plain, Size: 4102 bytes --]
kvm-68 with Avi's patch and flood threshold at 3:
mmio_exit mmu_cache mmu_flood mmu_pde_z mmu_pte_u mmu_pte_w mmu_recyc mmu_shado
175 880 880 0 1832 2714 0 880
35 868 868 0 1782 2650 0 868
91 8522 8520 131 29179 38651 0 8722
28 991 992 0 2314 3312 0 992
91 796 796 0 1648 2445 0 796
81 1944 1943 0 7241 9213 0 1943
98 4149 4148 31 11975 16196 0 4214
41 3379 3380 0 9710 13100 0 3378
42 17729 17730 0 48415 66152 0 17729
guest has an apparent lockup at this point and when it unfreezes kscand cpu
time jumps on the order of the time command line response was frozen (on the
order of 30 seconds or more)
14 18634 18633 0 48286 66921 0 18634
21 18607 18607 0 48395 67001 0 18607
91 17991 17991 0 50039 68040 0 17991
7 17919 17920 0 53731 71650 0 17919
7 18060 18060 0 53539 71599 0 18060
21 17755 17755 0 52714 70469 0 17755
-----------------------
with Avi's patch and flood threshold at 5.
mmio_exit mmu_cache mmu_flood mmu_pde_z mmu_pte_u mmu_pte_w mmu_recyc mmu_shado
147 604 602 42 21299 21957 0 660
112 163 167 23 7567 7759 0 170
105 0 1 2 3378 3381 0 1
14 4 4 0 9685 9689 0 4
137 628 623 43 21557 22255 0 682
42 0 2 4 5834 5840 0 2
91 14 16 0 25741 25757 0 16
28 58 55 0 23571 23626 0 55
84 627 624 45 32588 33268 0 685
132 9 13 1 12162 12177 0 13
91 0 1 0 3422 3423 0 1
35 1 1 0 4624 4625 0 1
102 237 244 0 12257 12504 0 242
19 401 387 46 20643 21088 0 449
26 3 4 1 127252 127261 0 4
guest has an apparent lockup at this point and when it unfreezes kscand cpu
time jumps on the order of the time command line response was frozen (on the
order of 30 seconds or more)
21 0 0 0 182651 182651 0 0
14 0 0 0 182524 182523 0 0
178 4 5 4 170752 170759 0 5
35 0 0 0 181471 181473 0 0
21 0 0 0 182263 182263 0 0
14 0 0 0 182493 182494 0 0
21 0 0 0 182489 182488 0 0
91 0 0 0 182203 182204 0 0
35 0 0 0 182378 182377 0 0
[-- Attachment #3: Type: text/plain, Size: 230 bytes --]
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
[-- Attachment #4: Type: text/plain, Size: 158 bytes --]
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
[not found] ` <4830F90A.1020809@cisco.com>
@ 2008-05-19 4:14 ` David S. Ahern
2008-05-19 14:27 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-19 4:14 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
[resend to new list].
David S. Ahern wrote:
> I was just digging through the sysstat history files, and I was not
> imagining it: I did have an excellent overnight run on 5/13-5/14 with
> your patch and the standard RHEL3U8 smp kernel in the guest. I have no
> idea why I cannot get anywhere close to that again. I have updated quite
> a few variables since then (such as going from 2.6.25-rc8 to 2.6.25.3
> kernel in the host), but backing them out (i.e., resetting the test to
> my recollection of all the details of 5/14) has not helped. baffling and
> frustrating.
>
> more in-line below.
>
>
> Avi Kivity wrote:
>> David S. Ahern wrote:
>>> Avi Kivity wrote:
>>>
>>>> Okay, I committed the patch without the flood count == 5.
>>>>
>>>>
>>> I've continued testing the RHEL3 guests with the flood count at 3, and I
>>> am right back to where I started. With the patch and the flood count at
>>> 3, I had 2 runs totaling around 24 hours that looked really good. Now, I
>>> am back to square one. I guess the short of it is that I am not sure if
>>> the patch resolves this issue or not.
>>>
>>>
>> What about with the flood count at 5? Does it reliably improve
>> performance?
>>
>
> [dsa] No. I saw the same problem with the flood count at 5. The
> attachment in the last email shows kvm_stat data during a kscand event.
> The data was collected with the patch you posted. With the flood count
> at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates
> at ~50,000/sec and writes at 70,000/sec. With the flood count at 5
> mmu_cache/flood drops to 0 and pte updates and writes both hit
> 180,000+/second. In both cases these last for 30 seconds or more. I only
> included data for the onset as it's pretty flat during the kscand activity.
>
>>> Also, in a prior e-mail I mentioned guest time advancing rapidly. I've
>>> noticed that with the -no-kvm-pit option the guest time is much better
>>> and typically stays within 3 seconds or so of the host, even through the
>>> high kscand activity which is one instance of when I've noticed time
>>> jumps with the kernel pit. Yes, this result has been repeatable through
>>> 6 or so runs. :-)
>>>
>> Strange. The in-kernel PIT was supposed to improve accuracy.
>>
>
> [dsa] I started a run with the RHEL4 guest 8 hours ago and it is showing
> the same kind of success. With the in-kernel PIT, time in the guest
> advanced ~120 seconds over real time after just 2 days of up time. With
> the userspace PIT, time in the guest is behind real time by only 1
> second after 8 hours of uptime. Note that I am running the RHEL4.6
> kernel recompiled with HZ at 250 instead of the usual 1000.
>
> david
>
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-19 4:14 ` [kvm-devel] " David S. Ahern
@ 2008-05-19 14:27 ` Avi Kivity
2008-05-19 16:25 ` David S. Ahern
2008-05-20 14:19 ` Avi Kivity
0 siblings, 2 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-19 14:27 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
David S. Ahern wrote:
>> [dsa] No. I saw the same problem with the flood count at 5. The
>> attachment in the last email shows kvm_stat data during a kscand event.
>> The data was collected with the patch you posted. With the flood count
>> at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates
>> at ~50,000/sec and writes at 70,000/sec. With the flood count at 5
>> mmu_cache/flood drops to 0 and pte updates and writes both hit
>> 180,000+/second. In both cases these last for 30 seconds or more. I only
>> included data for the onset as it's pretty flat during the kscand activity.
>>
It makes sense. We removed a flooding false positive, and introduced a
false negative.
The guest access sequence is:
- point kmap pte at page table
- use the new pte to access the page table
Prior to the patch, the mmu didn't see the 'use' part, so it concluded
the kmap pte would be better off unshadowed. This shows up as a high
flood count.
After the patch, this no longer happens, so the sequence can repreat for
long periods. However the pte that is the result of the 'use' part is
never accessed, so it should be detected as flooded! But our flood
detection mechanism looks at one page at a time (per vcpu), while there
are two pages involved here.
There are (at least) three options available:
- detect and special-case this scenario
- change the flood detector to be per page table instead of per vcpu
- change the flood detector to look at a list of recently used page
tables instead of the last page table
I'm having a hard time trying to pick between the second and third options.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-19 14:27 ` Avi Kivity
@ 2008-05-19 16:25 ` David S. Ahern
2008-05-19 17:04 ` Avi Kivity
2008-05-20 14:19 ` Avi Kivity
1 sibling, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-19 16:25 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
Does the fact that the hugemem kernel works just fine have any bearing
on your options? Or rather, is there something unique about the way
kscand works in the hugemem kernel that its performance is ok?
I mentioned last month (so without your first patch) that running the
hugemem kernel showed a remarkable improvement in performance compared
to the standard smp kernel. Over the weekend I ran a test with your
first patch and with the flood detector at 3 (I have not run a case with
the detector at 5) and performance with the hugemem was even better in
the sense that 1-minute averages of guest system time show no noticeable
spikes.
In an earlier post I showed a diff in the config files for the standard
SMP and hugemem kernels. See:
http://article.gmane.org/gmane.comp.emulators.kvm.devel/16944/
david
Avi Kivity wrote:
> David S. Ahern wrote:
>>> [dsa] No. I saw the same problem with the flood count at 5. The
>>> attachment in the last email shows kvm_stat data during a kscand event.
>>> The data was collected with the patch you posted. With the flood count
>>> at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates
>>> at ~50,000/sec and writes at 70,000/sec. With the flood count at 5
>>> mmu_cache/flood drops to 0 and pte updates and writes both hit
>>> 180,000+/second. In both cases these last for 30 seconds or more. I only
>>> included data for the onset as it's pretty flat during the kscand
>>> activity.
>>>
>
> It makes sense. We removed a flooding false positive, and introduced a
> false negative.
>
> The guest access sequence is:
> - point kmap pte at page table
> - use the new pte to access the page table
>
> Prior to the patch, the mmu didn't see the 'use' part, so it concluded
> the kmap pte would be better off unshadowed. This shows up as a high
> flood count.
>
> After the patch, this no longer happens, so the sequence can repreat for
> long periods. However the pte that is the result of the 'use' part is
> never accessed, so it should be detected as flooded! But our flood
> detection mechanism looks at one page at a time (per vcpu), while there
> are two pages involved here.
>
> There are (at least) three options available:
> - detect and special-case this scenario
> - change the flood detector to be per page table instead of per vcpu
> - change the flood detector to look at a list of recently used page
> tables instead of the last page table
>
> I'm having a hard time trying to pick between the second and third options.
>
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-19 16:25 ` David S. Ahern
@ 2008-05-19 17:04 ` Avi Kivity
0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-19 17:04 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
David S. Ahern wrote:
> Does the fact that the hugemem kernel works just fine have any bearing
> on your options? Or rather, is there something unique about the way
> kscand works in the hugemem kernel that its performance is ok?
>
>
Yes. If your guest has < 4GB of memory, then all of it is lowmem in the
hugemem kernel, and the two-step process for modifying a pte is
short-circuited into just one step, and everything works fine.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-19 14:27 ` Avi Kivity
2008-05-19 16:25 ` David S. Ahern
@ 2008-05-20 14:19 ` Avi Kivity
2008-05-20 14:34 ` Avi Kivity
2008-05-22 22:08 ` David S. Ahern
1 sibling, 2 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-20 14:19 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
[-- Attachment #1: Type: text/plain, Size: 798 bytes --]
Avi Kivity wrote:
>
> There are (at least) three options available:
> - detect and special-case this scenario
> - change the flood detector to be per page table instead of per vcpu
> - change the flood detector to look at a list of recently used page
> tables instead of the last page table
>
> I'm having a hard time trying to pick between the second and third
> options.
>
The answer turns out to be "yes", so here's a patch that adds a pte
access history table for each shadowed guest page-table. Let me know if
it helps. Benchmarking a variety of workloads on all guests supported
by kvm is left as an exercise for the reader, but I suspect the patch
will either improve things all around, or can be modified to do so.
--
error compiling committee.c: too many arguments to function
[-- Attachment #2: per-page-pte-history.patch --]
[-- Type: text/x-patch, Size: 4637 bytes --]
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 154727d..1a3d01a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1130,7 +1130,8 @@ unshadowed:
if (speculative) {
vcpu->arch.last_pte_updated = shadow_pte;
vcpu->arch.last_pte_gfn = gfn;
- }
+ } else
+ page_header(__pa(shadow_pte))->pte_history_len = 0;
}
static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -1616,13 +1617,6 @@ static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, u64 old, u64 new)
kvm_mmu_flush_tlb(vcpu);
}
-static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
-{
- u64 *spte = vcpu->arch.last_pte_updated;
-
- return !!(spte && (*spte & shadow_accessed_mask));
-}
-
static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
const u8 *new, int bytes)
{
@@ -1679,13 +1673,49 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
{
u64 *spte = vcpu->arch.last_pte_updated;
+ struct kvm_mmu_page *page;
+
+ if (spte && vcpu->arch.last_pte_gfn == gfn) {
+ page = page_header(__pa(spte));
+ page->pte_history_len = 0;
+ pgprintk("clearing page history, gfn %x ent %lx\n",
+ page->gfn, spte - page->spt);
+ }
+}
+
+static bool kvm_mmu_page_flooded(struct kvm_mmu_page *page)
+{
+ int i, j, ent, len;
- if (spte
- && vcpu->arch.last_pte_gfn == gfn
- && shadow_accessed_mask
- && !(*spte & shadow_accessed_mask)
- && is_shadow_present_pte(*spte))
- set_bit(PT_ACCESSED_SHIFT, spte);
+ len = page->pte_history_len;
+ for (i = len; i != 0; --i) {
+ ent = page->pte_history[i - 1];
+ if (test_bit(PT_ACCESSED_SHIFT, &page->spt[ent])) {
+ for (j = i; j < len; ++j)
+ page->pte_history[j-i] = page->pte_history[j];
+ page->pte_history_len = len - i;
+ return false;
+ }
+ }
+ if (page->pte_history_len < KVM_MAX_PTE_HISTORY)
+ return false;
+ return true;
+}
+
+static void kvm_mmu_log_pte_history(struct kvm_mmu_page *page, u64 *spte)
+{
+ int i;
+ unsigned ent = spte - page->spt;
+
+ if (page->pte_history_len > 0
+ && page->pte_history[page->pte_history_len - 1] == ent)
+ return;
+ if (page->pte_history_len == KVM_MAX_PTE_HISTORY) {
+ for (i = 1; i < KVM_MAX_PTE_HISTORY; ++i)
+ page->pte_history[i-1] = page->pte_history[i];
+ --page->pte_history_len;
+ }
+ page->pte_history[page->pte_history_len++] = ent;
}
void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -1704,7 +1734,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
unsigned misaligned;
unsigned quadrant;
int level;
- int flooded = 0;
int npte;
int r;
@@ -1715,16 +1744,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
kvm_mmu_free_some_pages(vcpu);
++vcpu->kvm->stat.mmu_pte_write;
kvm_mmu_audit(vcpu, "pre pte write");
- if (gfn == vcpu->arch.last_pt_write_gfn
- && !last_updated_pte_accessed(vcpu)) {
- ++vcpu->arch.last_pt_write_count;
- if (vcpu->arch.last_pt_write_count >= 3)
- flooded = 1;
- } else {
- vcpu->arch.last_pt_write_gfn = gfn;
- vcpu->arch.last_pt_write_count = 1;
- vcpu->arch.last_pte_updated = NULL;
- }
index = kvm_page_table_hashfn(gfn);
bucket = &vcpu->kvm->arch.mmu_page_hash[index];
hlist_for_each_entry_safe(sp, node, n, bucket, hash_link) {
@@ -1733,7 +1752,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
pte_size = sp->role.glevels == PT32_ROOT_LEVEL ? 4 : 8;
misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
misaligned |= bytes < 4;
- if (misaligned || flooded) {
+ if (misaligned || kvm_mmu_page_flooded(sp)) {
/*
* Misaligned accesses are too much trouble to fix
* up; also, they usually indicate a page is not used
@@ -1785,6 +1804,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
mmu_pte_write_zap_pte(vcpu, sp, spte);
if (new)
mmu_pte_write_new_pte(vcpu, sp, spte, new);
+ kvm_mmu_log_pte_history(sp, spte);
mmu_pte_write_flush_tlb(vcpu, entry, *spte);
++spte;
}
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index a71f3aa..cbe550e 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -78,6 +78,7 @@
#define KVM_MIN_FREE_MMU_PAGES 5
#define KVM_REFILL_PAGES 25
#define KVM_MAX_CPUID_ENTRIES 40
+#define KVM_MAX_PTE_HISTORY 4
extern spinlock_t kvm_lock;
extern struct list_head vm_list;
@@ -189,6 +190,9 @@ struct kvm_mmu_page {
u64 *parent_pte; /* !multimapped */
struct hlist_head parent_ptes; /* multimapped, kvm_pte_chain */
};
+
+ u16 pte_history_len;
+ u16 pte_history[KVM_MAX_PTE_HISTORY];
};
/*
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-20 14:19 ` Avi Kivity
@ 2008-05-20 14:34 ` Avi Kivity
2008-05-22 22:08 ` David S. Ahern
1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-20 14:34 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
Avi Kivity wrote:
>
> The answer turns out to be "yes", so here's a patch that adds a pte
> access history table for each shadowed guest page-table. Let me know
> if it helps. Benchmarking a variety of workloads on all guests
> supported by kvm is left as an exercise for the reader, but I suspect
> the patch will either improve things all around, or can be modified to
> do so.
>
btw, the patch applied on top of kvm HEAD (which includes the previous
patch).
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-20 14:19 ` Avi Kivity
2008-05-20 14:34 ` Avi Kivity
@ 2008-05-22 22:08 ` David S. Ahern
2008-05-28 10:51 ` Avi Kivity
1 sibling, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-22 22:08 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]
The short answer is that I am still see large system time hiccups in the
guests due to kscand in the guest scanning its active lists. I do see
better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
completeness I also tried a history of 2, but it performed worse than 3
which is no surprise given the meaning of it.)
I have been able to scratch out a simplistic program that stimulates
kscand activity similar to what is going on in my real guest (see
attached). The program requests a memory allocation, initializes it (to
get it backed) and then in a loop sweeps through the memory in chunks
similar to a program using parts of its memory here and there but
eventually accessing all of it.
Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
using a fair amount of highmem. Start a couple of instances of the
attached. For example, I've been using these 2:
memuser 768M 120 5 300
memuser 384M 300 10 600
Together these instances take up a 1GB of RAM and once initialized
consume very little CPU. On kvm they make kscand and kswapd go nuts
every 5-15 minutes. For comparison, I do not see the same behavior for
an identical setup running on esx 3.5.
david
Avi Kivity wrote:
> Avi Kivity wrote:
>>
>> There are (at least) three options available:
>> - detect and special-case this scenario
>> - change the flood detector to be per page table instead of per vcpu
>> - change the flood detector to look at a list of recently used page
>> tables instead of the last page table
>>
>> I'm having a hard time trying to pick between the second and third
>> options.
>>
>
> The answer turns out to be "yes", so here's a patch that adds a pte
> access history table for each shadowed guest page-table. Let me know if
> it helps. Benchmarking a variety of workloads on all guests supported
> by kvm is left as an exercise for the reader, but I suspect the patch
> will either improve things all around, or can be modified to do so.
>
[-- Attachment #2: memuser.c --]
[-- Type: text/x-csrc, Size: 2621 bytes --]
/* simple program to malloc memory, inialize it, and
* then repetitively use it to keep it active.
*/
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <libgen.h>
/* goal is to sweep memory every T1 sec by accessing a
* percentage at a time and sleeping T2 sec in between accesses.
* Once all the memory has been accessed, sleep for T3 sec
* before starting the cycle over.
*/
#define T1 180
#define T2 5
#define T3 300
const char *timestamp(void);
void usage(const char *prog) {
fprintf(stderr, "\nusage: %s memlen{M|K}) [t1 t2 t3]\n", prog);
}
int main(int argc, char *argv[])
{
int len;
char *endp;
int factor, endp_len;
int start, incr;
int t1 = T1, t2 = T2, t3 = T3;
char *mem;
char c = 0;
if (argc < 2) {
usage(basename(argv[0]));
return 1;
}
/*
* determine memory to request
*/
len = (int) strtol(argv[1], &endp, 0);
factor = 1;
endp_len = strlen(endp);
if ((endp_len == 1) && ((*endp == 'M') || (*endp == 'm')))
factor = 1024 * 1024;
else if ((endp_len == 1) && ((*endp == 'K') || (*endp == 'k')))
factor = 1024;
else if (endp_len) {
fprintf(stderr, "invalid memory len.\n");
return 1;
}
len *= factor;
if (len == 0) {
fprintf(stdout, "memory len is 0.\n");
return 1;
}
/*
* convert times if given
*/
if (argc > 2) {
if (argc < 5) {
usage(basename(argv[0]));
return 1;
}
t1 = atoi(argv[2]);
t2 = atoi(argv[3]);
t3 = atoi(argv[4]);
}
/*
* amount of memory to sweep at one time
*/
if (t1 && t2)
incr = len / t1 * t2;
else
incr = len;
mem = (char *) malloc(len);
if (mem == NULL) {
fprintf(stderr, "malloc failed\n");
return 1;
}
printf("memory allocated. initializing to 0\n");
memset(mem, 0, len);
start = 0;
printf("%s starting memory update.\n", timestamp());
while (1) {
c++;
if (c == 0x7f) c = 0;
memset(mem + start, c, incr);
start += incr;
if ((start >= len) || ((start + incr) >= len)) {
printf("%s scan complete. sleeping %d\n",
timestamp(), t3);
start = 0;
sleep(t3);
printf("%s starting memory update.\n", timestamp());
} else if (t2)
sleep(t2);
}
return 0;
}
const char *timestamp(void)
{
static char date[64];
struct timeval now;
struct tm ltime;
memset(date, 0, sizeof(date));
if (gettimeofday(&now, NULL) == 0)
{
if (localtime_r(&now.tv_sec, <ime))
strftime(date, sizeof(date), "%m/%d %H:%M:%S", <ime);
}
if (strlen(date) == 0)
strcpy(date, "unknown");
return date;
}
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-22 22:08 ` David S. Ahern
@ 2008-05-28 10:51 ` Avi Kivity
2008-05-28 14:13 ` David S. Ahern
2008-05-29 16:42 ` David S. Ahern
0 siblings, 2 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-28 10:51 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
David S. Ahern wrote:
> The short answer is that I am still see large system time hiccups in the
> guests due to kscand in the guest scanning its active lists. I do see
> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
> completeness I also tried a history of 2, but it performed worse than 3
> which is no surprise given the meaning of it.)
>
>
> I have been able to scratch out a simplistic program that stimulates
> kscand activity similar to what is going on in my real guest (see
> attached). The program requests a memory allocation, initializes it (to
> get it backed) and then in a loop sweeps through the memory in chunks
> similar to a program using parts of its memory here and there but
> eventually accessing all of it.
>
> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
> using a fair amount of highmem. Start a couple of instances of the
> attached. For example, I've been using these 2:
>
> memuser 768M 120 5 300
> memuser 384M 300 10 600
>
> Together these instances take up a 1GB of RAM and once initialized
> consume very little CPU. On kvm they make kscand and kswapd go nuts
> every 5-15 minutes. For comparison, I do not see the same behavior for
> an identical setup running on esx 3.5.
>
I haven't been able to reproduce this:
> [root@localhost root]# ps -elf | grep -E 'memuser|kscand'
> 1 S root 7 1 1 75 0 - 0 schedu 10:07 ?
> 00:00:26 [kscand]
> 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0
> 00:00:21 ./memuser 768M 120 5 300
> 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0
> 00:00:10 ./memuser 384M 300 10 600
> 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0
> 00:00:00 grep -E memuser|kscand
The workload has been running for about half an hour, and kswapd cpu
usage doesn't seem significant. This is a 2GB guest running with my
patch ported to kvm.git HEAD. Guest is has 2G of memory.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 10:51 ` Avi Kivity
@ 2008-05-28 14:13 ` David S. Ahern
2008-05-28 14:35 ` Avi Kivity
2008-05-28 14:48 ` Andrea Arcangeli
2008-05-29 16:42 ` David S. Ahern
1 sibling, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-05-28 14:13 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
Weird. Could it be something about the hosts?
I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13
GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel.
I'll rebuild kvm-69 with your latest patch and try the test programs again.
david
Avi Kivity wrote:
> David S. Ahern wrote:
>> The short answer is that I am still see large system time hiccups in the
>> guests due to kscand in the guest scanning its active lists. I do see
>> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
>> completeness I also tried a history of 2, but it performed worse than 3
>> which is no surprise given the meaning of it.)
>>
>>
>> I have been able to scratch out a simplistic program that stimulates
>> kscand activity similar to what is going on in my real guest (see
>> attached). The program requests a memory allocation, initializes it (to
>> get it backed) and then in a loop sweeps through the memory in chunks
>> similar to a program using parts of its memory here and there but
>> eventually accessing all of it.
>>
>> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
>> using a fair amount of highmem. Start a couple of instances of the
>> attached. For example, I've been using these 2:
>>
>> memuser 768M 120 5 300
>> memuser 384M 300 10 600
>>
>> Together these instances take up a 1GB of RAM and once initialized
>> consume very little CPU. On kvm they make kscand and kswapd go nuts
>> every 5-15 minutes. For comparison, I do not see the same behavior for
>> an identical setup running on esx 3.5.
>>
>
> I haven't been able to reproduce this:
>
>> [root@localhost root]# ps -elf | grep -E 'memuser|kscand'
>> 1 S root 7 1 1 75 0 - 0 schedu 10:07 ?
>> 00:00:26 [kscand]
>> 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0
>> 00:00:21 ./memuser 768M 120 5 300
>> 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0
>> 00:00:10 ./memuser 384M 300 10 600
>> 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0
>> 00:00:00 grep -E memuser|kscand
>
> The workload has been running for about half an hour, and kswapd cpu
> usage doesn't seem significant. This is a 2GB guest running with my
> patch ported to kvm.git HEAD. Guest is has 2G of memory.
>
>
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 14:13 ` David S. Ahern
@ 2008-05-28 14:35 ` Avi Kivity
2008-05-28 19:49 ` David S. Ahern
2008-05-28 14:48 ` Andrea Arcangeli
1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-28 14:35 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
David S. Ahern wrote:
> Weird. Could it be something about the hosts?
>
> I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13
> GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel.
>
> I'll rebuild kvm-69 with your latest patch and try the test programs again.
>
I've pushed it into kvm.git, branch name per-page-pte-tracking.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 14:13 ` David S. Ahern
2008-05-28 14:35 ` Avi Kivity
@ 2008-05-28 14:48 ` Andrea Arcangeli
2008-05-28 14:57 ` Avi Kivity
2008-05-28 15:37 ` Avi Kivity
1 sibling, 2 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2008-05-28 14:48 UTC (permalink / raw)
To: David S. Ahern; +Cc: Avi Kivity, kvm
On Wed, May 28, 2008 at 08:13:44AM -0600, David S. Ahern wrote:
> Weird. Could it be something about the hosts?
Note that the VM itself will never make use of kmap. The VM is "data"
agonistic. The VM has never any idea with the data contained by the
pages. kmap/kmap_atomic/kunmap_atomic are only need to access _data_.
Only I/O (if not using DMA, or because of bounce buffers) and page
faults triggered in user process context, or other operations again
done from user process context will call into kmap or kmap_atomic.
And if KVM is inefficient in handling kmap/kmap_atomic that will lead
to the user process running slower, and in turn generating less
pressure to the guest and host VM if something. Guest will run slower
than it should if KVM isn't optimized for the workload but it
shouldn't alter any VM kernel thread CPU usage, only the CPU usage of
the guest process context and host system time in qemu task should go
up, nothing else. This is again because the VM will never care about
the data contents and it'll never invoked kmap/kmap_atomic.
So I never found a relation to the symptom reported of VM kernel
threads going weird, with KVM optimal handling of kmap ptes.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 14:48 ` Andrea Arcangeli
@ 2008-05-28 14:57 ` Avi Kivity
2008-05-28 15:39 ` David S. Ahern
2008-05-28 15:58 ` Andrea Arcangeli
2008-05-28 15:37 ` Avi Kivity
1 sibling, 2 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-28 14:57 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: David S. Ahern, kvm
Andrea Arcangeli wrote:
> On Wed, May 28, 2008 at 08:13:44AM -0600, David S. Ahern wrote:
>
>> Weird. Could it be something about the hosts?
>>
>
> Note that the VM itself will never make use of kmap. The VM is "data"
> agonistic. The VM has never any idea with the data contained by the
> pages. kmap/kmap_atomic/kunmap_atomic are only need to access _data_.
>
>
What about CONFIG_HIGHPTE?
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 14:48 ` Andrea Arcangeli
2008-05-28 14:57 ` Avi Kivity
@ 2008-05-28 15:37 ` Avi Kivity
2008-05-28 15:43 ` David S. Ahern
1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-28 15:37 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: David S. Ahern, kvm
Andrea Arcangeli wrote:
>
> So I never found a relation to the symptom reported of VM kernel
> threads going weird, with KVM optimal handling of kmap ptes.
>
The problem is this code:
static int scan_active_list(struct zone_struct * zone, int age,
struct list_head * list)
{
struct list_head *page_lru , *next;
struct page * page;
int over_rsslimit;
/* Take the lock while messing with the list... */
lru_lock(zone);
list_for_each_safe(page_lru, next, list) {
page = list_entry(page_lru, struct page, lru);
pte_chain_lock(page);
if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
age_page_up_nolock(page, age);
pte_chain_unlock(page);
}
lru_unlock(zone);
return 0;
}
If the pages in the list are in the same order as in the ptes (which is
very likely), then we have the following access pattern
- set up kmap to point at pte
- test_and_clear_bit(pte)
- kunmap
From kvm's point of view this looks like
- several accesses to set up the kmap
- if these accesses trigger flooding, we will have to tear down the
shadow for this page, only to set it up again soon
- an access to the pte (emulted)
- if this access _doesn't_ trigger flooding, we will have 512 unneeded
emulations. The pte is worthless anyway since the accessed bit is clear
(so we can't set up a shadow pte for it)
- this bug was fixed
- an access to tear down the kmap
[btw, am I reading this right? the entire list is scanned each time?
if you have 1G of active HIGHMEM, that's a quarter of a million pages,
which would take at least a second no matter what we do. VMware can
probably special-case kmaps, but we can't]
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 14:57 ` Avi Kivity
@ 2008-05-28 15:39 ` David S. Ahern
2008-05-29 11:49 ` Avi Kivity
2008-05-29 12:10 ` Avi Kivity
2008-05-28 15:58 ` Andrea Arcangeli
1 sibling, 2 replies; 73+ messages in thread
From: David S. Ahern @ 2008-05-28 15:39 UTC (permalink / raw)
To: Avi Kivity, Andrea Arcangeli; +Cc: kvm
I've been instrumenting the guest kernel as well. It's the scanning of
the active lists that triggers a lot of calls to paging64_prefetch_page,
and, as you guys know, correlates with the number of direct pages in the
list. Earlier in this thread I traced the kvm cycles to
paging64_prefetch_page(). See
http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg16332.html
In the guest I started capturing scans (kscand() loop) that took longer
than a jiffie. Here's an example for 1 trip through the active lists,
both anonymous and cache:
active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct
36234, dj 225
active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct 1249, dj 3
active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct
84829, dj 848
active_cache_scan: HighMem, age 12, count[age] 3397 -> 2640, direct 889,
dj 19
active_cache_scan: HighMem, age 8, count[age] 6105 -> 5884, direct 988,
dj 24
active_cache_scan: HighMem, age 4, count[age] 18923 -> 18400, direct
11141, dj 37
active_cache_scan: HighMem, age 0, count[age] 14283 -> 14283, direct 69,
dj 1
An explanation of the line (using the first one): it's a scan of the
anonymous list, age bucket of 4. Before the scan loop the bucket had
41863 pages and after the loop the bucket had 30194. Of the pages in the
bucket 36234 were direct pages(ie., PageDirect(page) was non-zero) and
for this bucket 225 jiffies passed while running scan_active_list().
On the host side the total times (sum of the dj's/100) in the output
above directly match with kvm_stat output, spikes in pte_writes/updates.
Tracing the RHEL3 code I believe linux-2.4.21-rmap.patch is the patch
that brought in the code that is run during the active list scans for
direct pgaes. In and of itself each trip through the while loop in
scan_active_list does not take a lot of time, but when run say 84,829
times (see age 0 above) the cumulative time is high, 8.48 seconds per
the example above.
I'll pull down the git branch and give it a spin.
david
Avi Kivity wrote:
> Andrea Arcangeli wrote:
>> On Wed, May 28, 2008 at 08:13:44AM -0600, David S. Ahern wrote:
>>
>>> Weird. Could it be something about the hosts?
>>>
>>
>> Note that the VM itself will never make use of kmap. The VM is "data"
>> agonistic. The VM has never any idea with the data contained by the
>> pages. kmap/kmap_atomic/kunmap_atomic are only need to access _data_.
>>
>>
>
> What about CONFIG_HIGHPTE?
>
>
>
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 15:37 ` Avi Kivity
@ 2008-05-28 15:43 ` David S. Ahern
2008-05-28 17:04 ` Andrea Arcangeli
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-28 15:43 UTC (permalink / raw)
To: Avi Kivity; +Cc: Andrea Arcangeli, kvm
This is the code in the RHEL3.8 kernel:
static int scan_active_list(struct zone_struct * zone, int age,
struct list_head * list, int count)
{
struct list_head *page_lru , *next;
struct page * page;
int over_rsslimit;
count = count * kscand_work_percent / 100;
/* Take the lock while messing with the list... */
lru_lock(zone);
while (count-- > 0 && !list_empty(list)) {
page = list_entry(list->prev, struct page, lru);
pte_chain_lock(page);
if (page_referenced(page, &over_rsslimit)
&& !over_rsslimit
&& check_mapping_inuse(page))
age_page_up_nolock(page, age);
else {
list_del(&page->lru);
list_add(&page->lru, list);
}
pte_chain_unlock(page);
}
lru_unlock(zone);
return 0;
}
My previous email shows examples of the number of pages in the list and
the scanning that happens.
david
Avi Kivity wrote:
> Andrea Arcangeli wrote:
>>
>> So I never found a relation to the symptom reported of VM kernel
>> threads going weird, with KVM optimal handling of kmap ptes.
>>
>
>
> The problem is this code:
>
> static int scan_active_list(struct zone_struct * zone, int age,
> struct list_head * list)
> {
> struct list_head *page_lru , *next;
> struct page * page;
> int over_rsslimit;
>
> /* Take the lock while messing with the list... */
> lru_lock(zone);
> list_for_each_safe(page_lru, next, list) {
> page = list_entry(page_lru, struct page, lru);
> pte_chain_lock(page);
> if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
> age_page_up_nolock(page, age);
> pte_chain_unlock(page);
> }
> lru_unlock(zone);
> return 0;
> }
>
> If the pages in the list are in the same order as in the ptes (which is
> very likely), then we have the following access pattern
>
> - set up kmap to point at pte
> - test_and_clear_bit(pte)
> - kunmap
>
> From kvm's point of view this looks like
>
> - several accesses to set up the kmap
> - if these accesses trigger flooding, we will have to tear down the
> shadow for this page, only to set it up again soon
> - an access to the pte (emulted)
> - if this access _doesn't_ trigger flooding, we will have 512 unneeded
> emulations. The pte is worthless anyway since the accessed bit is clear
> (so we can't set up a shadow pte for it)
> - this bug was fixed
> - an access to tear down the kmap
>
> [btw, am I reading this right? the entire list is scanned each time?
>
> if you have 1G of active HIGHMEM, that's a quarter of a million pages,
> which would take at least a second no matter what we do. VMware can
> probably special-case kmaps, but we can't]
>
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 14:57 ` Avi Kivity
2008-05-28 15:39 ` David S. Ahern
@ 2008-05-28 15:58 ` Andrea Arcangeli
1 sibling, 0 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2008-05-28 15:58 UTC (permalink / raw)
To: Avi Kivity; +Cc: David S. Ahern, kvm
On Wed, May 28, 2008 at 05:57:21PM +0300, Avi Kivity wrote:
> What about CONFIG_HIGHPTE?
Ah yes sorry! Official 2.4 has no highpte capability but surely RH
backported highpte to 2.4 so that would explain the cpu time spent in
kswapd _guest_ context.
If highpte is the problem and you've troubles reproducing, I recommend
running some dozen of those in background on the 2.4 VM that has the
ZERO_PAGE support immediately after boot. This will ensure there will
be tons of pagetables in highmemory. This should allocate purely
pagetables and allow for a worst case of highpte. Check with
/proc/meminfo that the pagetable number goes up of a few megabytes for
each one of those tasks. Then just try to allocate some real ram (not
zeropage) and if there's a problem with highptes it should be possible
to reproduce it with so many highptes allocated in the system. Guest
VM size should be 2G, you don't really need more than 2G to reproduce
by using the below ZERO_PAGE trick.
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char *p1, *p2;
p1 = malloc(512*1024*1024);
p2 = malloc(512*1024*1024);
if (memcmp(p1, p2, 512*1024*1024))
*(char *)0 = 0;
pause();
return 0;
}
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 15:43 ` David S. Ahern
@ 2008-05-28 17:04 ` Andrea Arcangeli
2008-05-28 17:24 ` David S. Ahern
2008-05-29 10:01 ` Avi Kivity
0 siblings, 2 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2008-05-28 17:04 UTC (permalink / raw)
To: David S. Ahern; +Cc: Avi Kivity, kvm
On Wed, May 28, 2008 at 09:43:09AM -0600, David S. Ahern wrote:
> This is the code in the RHEL3.8 kernel:
>
> static int scan_active_list(struct zone_struct * zone, int age,
> struct list_head * list, int count)
> {
> struct list_head *page_lru , *next;
> struct page * page;
> int over_rsslimit;
>
> count = count * kscand_work_percent / 100;
> /* Take the lock while messing with the list... */
> lru_lock(zone);
> while (count-- > 0 && !list_empty(list)) {
> page = list_entry(list->prev, struct page, lru);
> pte_chain_lock(page);
> if (page_referenced(page, &over_rsslimit)
> && !over_rsslimit
> && check_mapping_inuse(page))
> age_page_up_nolock(page, age);
> else {
> list_del(&page->lru);
> list_add(&page->lru, list);
> }
> pte_chain_unlock(page);
> }
> lru_unlock(zone);
> return 0;
> }
>
> My previous email shows examples of the number of pages in the list and
> the scanning that happens.
This code looks better than the one below, as a limit was introduced
and the whole list isn't scanned anymore, if you decrease
kscand_work_percent (I assume it's a sysctl even if it's missing the
sysctl_ prefix) to say 1, you can limit damages. Did you try it?
> Avi Kivity wrote:
> > Andrea Arcangeli wrote:
> >>
> >> So I never found a relation to the symptom reported of VM kernel
> >> threads going weird, with KVM optimal handling of kmap ptes.
> >>
> >
> >
> > The problem is this code:
> >
> > static int scan_active_list(struct zone_struct * zone, int age,
> > struct list_head * list)
> > {
> > struct list_head *page_lru , *next;
> > struct page * page;
> > int over_rsslimit;
> >
> > /* Take the lock while messing with the list... */
> > lru_lock(zone);
> > list_for_each_safe(page_lru, next, list) {
> > page = list_entry(page_lru, struct page, lru);
> > pte_chain_lock(page);
> > if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
> > age_page_up_nolock(page, age);
> > pte_chain_unlock(page);
> > }
> > lru_unlock(zone);
> > return 0;
> > }
>
> > If the pages in the list are in the same order as in the ptes (which is
> > very likely), then we have the following access pattern
Yes it is likely.
> > - set up kmap to point at pte
> > - test_and_clear_bit(pte)
> > - kunmap
> >
> > From kvm's point of view this looks like
> >
> > - several accesses to set up the kmap
Hmm, the kmap establishment takes a single guest operation in the
fixmap area. That's a single write to the pte, to write a pte_t 8/4
byte large region (PAE/non-PAE). The same pte_t is then cleared and
flushed out of the tlb with a cpu-local invlpg during kunmap_atomic.
I count 1 write here so far.
> > - if these accesses trigger flooding, we will have to tear down the
> > shadow for this page, only to set it up again soon
So the shadow mapping the fixmap area would be tear down by the
flooding.
Or is the shadow corresponding to the real user pte pointed by the
fixmap, that is unshadowed by the flooding, or both/all?
> > - an access to the pte (emulted)
Here I count the second write and this isn't done on the fixmap area
like the first write above, but this is a write to the real user pte,
pointed by the fixmap. So if this is emulated it means the shadow of
the user pte pointing to the real data page is still active.
> > - if this access _doesn't_ trigger flooding, we will have 512 unneeded
> > emulations. The pte is worthless anyway since the accessed bit is clear
> > (so we can't set up a shadow pte for it)
> > - this bug was fixed
You mean the accessed bit on fixmap pte used by kmap? Or the user pte
pointed by the fixmap pte?
> > - an access to tear down the kmap
Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
matters).
> > [btw, am I reading this right? the entire list is scanned each time?
If the list parameter isn't a local LIST_HEAD on the stack but the
global one it's a full scan each time. I guess it's the global list
looking at the new code at the top that has a kswapd_scan_limit
sysctl.
> > if you have 1G of active HIGHMEM, that's a quarter of a million pages,
> > which would take at least a second no matter what we do. VMware can
> > probably special-case kmaps, but we can't]
Perhaps they've a list per-age bucket or similar but still I doubt
this works well on host either... I guess the virtualization overhead
is exacerbating the inefficiency. Perhaps killall -STOP kscand is good
enough fix ;). This seem to only push the age up, to be functional the
age has to go down and I guess the go-down is done by other threads so
stopping kscand may not hurt.
I think what we should aim for is to quickly reach this condition:
1) always keep the fixmap/kmap pte_t shadowed and emulate the
kmap/kunmap access so the test_and_clear_young done on the user pte
doesn't require to re-establish the spte representing the fixmap
virtual address. If we don't emulate fixmap we'll have to
re-establish the spte during the write to the user pte, and
tear it down again during kunmap_atomic. So there's not much doubt
fixmap access emulation is worth it.
2) get rid of the user pte shadow mapping pointing to the user data so
the test_and_clear of the young bitflag on the user pte will not be
emulated and it'll run at full CPU speed through the shadow pte
mapping corresponding to the fixmap virtual address
kscand pattern is the same as running mprotect on a 32bit 2.6
kernel so it sounds worth optimizing for it, even if kscand may be
unfixable without killall -STOP kscand or VM fixes to guest.
However I'm not sure about point 2 at the light of mprotect. With
mprotect the guest virutal addresses mapped by the guest user ptes
will be used. It's not like kscand that may write forever to the user
ptes without ever using the guest virtual addresses that they're
mapping. So we better be sure that by unshadowing and optimizing
kscand we're not hurting mprotect or other pte mangling operations in
2.6 that will likely keep accessing the guest virtual addresses mapped
by the user ptes previously modified.
Hope this makes any sense, I'm not sure to understand this completely.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 17:04 ` Andrea Arcangeli
@ 2008-05-28 17:24 ` David S. Ahern
2008-05-29 10:01 ` Avi Kivity
1 sibling, 0 replies; 73+ messages in thread
From: David S. Ahern @ 2008-05-28 17:24 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Avi Kivity, kvm
Yes, I've tried changing kscand_work_percent (values of 50 and 30).
Basically it makes kscand wake more often (ie.,MIN_AGING_INTERVAL
declines in proportion) put do less work each trip through the lists.
I have not seen a noticeable change in guest behavior.
david
Andrea Arcangeli wrote:
> On Wed, May 28, 2008 at 09:43:09AM -0600, David S. Ahern wrote:
>> This is the code in the RHEL3.8 kernel:
>>
>> static int scan_active_list(struct zone_struct * zone, int age,
>> struct list_head * list, int count)
>> {
>> struct list_head *page_lru , *next;
>> struct page * page;
>> int over_rsslimit;
>>
>> count = count * kscand_work_percent / 100;
>> /* Take the lock while messing with the list... */
>> lru_lock(zone);
>> while (count-- > 0 && !list_empty(list)) {
>> page = list_entry(list->prev, struct page, lru);
>> pte_chain_lock(page);
>> if (page_referenced(page, &over_rsslimit)
>> && !over_rsslimit
>> && check_mapping_inuse(page))
>> age_page_up_nolock(page, age);
>> else {
>> list_del(&page->lru);
>> list_add(&page->lru, list);
>> }
>> pte_chain_unlock(page);
>> }
>> lru_unlock(zone);
>> return 0;
>> }
>>
>> My previous email shows examples of the number of pages in the list and
>> the scanning that happens.
>
> This code looks better than the one below, as a limit was introduced
> and the whole list isn't scanned anymore, if you decrease
> kscand_work_percent (I assume it's a sysctl even if it's missing the
> sysctl_ prefix) to say 1, you can limit damages. Did you try it?
>
>> Avi Kivity wrote:
>>> Andrea Arcangeli wrote:
>>>> So I never found a relation to the symptom reported of VM kernel
>>>> threads going weird, with KVM optimal handling of kmap ptes.
>>>>
>>>
>>> The problem is this code:
>>>
>>> static int scan_active_list(struct zone_struct * zone, int age,
>>> struct list_head * list)
>>> {
>>> struct list_head *page_lru , *next;
>>> struct page * page;
>>> int over_rsslimit;
>>>
>>> /* Take the lock while messing with the list... */
>>> lru_lock(zone);
>>> list_for_each_safe(page_lru, next, list) {
>>> page = list_entry(page_lru, struct page, lru);
>>> pte_chain_lock(page);
>>> if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
>>> age_page_up_nolock(page, age);
>>> pte_chain_unlock(page);
>>> }
>>> lru_unlock(zone);
>>> return 0;
>>> }
>>> If the pages in the list are in the same order as in the ptes (which is
>>> very likely), then we have the following access pattern
>
> Yes it is likely.
>
>>> - set up kmap to point at pte
>>> - test_and_clear_bit(pte)
>>> - kunmap
>>>
>>> From kvm's point of view this looks like
>>>
>>> - several accesses to set up the kmap
>
> Hmm, the kmap establishment takes a single guest operation in the
> fixmap area. That's a single write to the pte, to write a pte_t 8/4
> byte large region (PAE/non-PAE). The same pte_t is then cleared and
> flushed out of the tlb with a cpu-local invlpg during kunmap_atomic.
>
> I count 1 write here so far.
>
>>> - if these accesses trigger flooding, we will have to tear down the
>>> shadow for this page, only to set it up again soon
>
> So the shadow mapping the fixmap area would be tear down by the
> flooding.
>
> Or is the shadow corresponding to the real user pte pointed by the
> fixmap, that is unshadowed by the flooding, or both/all?
>
>>> - an access to the pte (emulted)
>
> Here I count the second write and this isn't done on the fixmap area
> like the first write above, but this is a write to the real user pte,
> pointed by the fixmap. So if this is emulated it means the shadow of
> the user pte pointing to the real data page is still active.
>
>>> - if this access _doesn't_ trigger flooding, we will have 512 unneeded
>>> emulations. The pte is worthless anyway since the accessed bit is clear
>>> (so we can't set up a shadow pte for it)
>>> - this bug was fixed
>
> You mean the accessed bit on fixmap pte used by kmap? Or the user pte
> pointed by the fixmap pte?
>
>>> - an access to tear down the kmap
>
> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
> matters).
>
>>> [btw, am I reading this right? the entire list is scanned each time?
>
> If the list parameter isn't a local LIST_HEAD on the stack but the
> global one it's a full scan each time. I guess it's the global list
> looking at the new code at the top that has a kswapd_scan_limit
> sysctl.
>
>>> if you have 1G of active HIGHMEM, that's a quarter of a million pages,
>>> which would take at least a second no matter what we do. VMware can
>>> probably special-case kmaps, but we can't]
>
> Perhaps they've a list per-age bucket or similar but still I doubt
> this works well on host either... I guess the virtualization overhead
> is exacerbating the inefficiency. Perhaps killall -STOP kscand is good
> enough fix ;). This seem to only push the age up, to be functional the
> age has to go down and I guess the go-down is done by other threads so
> stopping kscand may not hurt.
>
> I think what we should aim for is to quickly reach this condition:
>
> 1) always keep the fixmap/kmap pte_t shadowed and emulate the
> kmap/kunmap access so the test_and_clear_young done on the user pte
> doesn't require to re-establish the spte representing the fixmap
> virtual address. If we don't emulate fixmap we'll have to
> re-establish the spte during the write to the user pte, and
> tear it down again during kunmap_atomic. So there's not much doubt
> fixmap access emulation is worth it.
>
> 2) get rid of the user pte shadow mapping pointing to the user data so
> the test_and_clear of the young bitflag on the user pte will not be
> emulated and it'll run at full CPU speed through the shadow pte
> mapping corresponding to the fixmap virtual address
>
> kscand pattern is the same as running mprotect on a 32bit 2.6
> kernel so it sounds worth optimizing for it, even if kscand may be
> unfixable without killall -STOP kscand or VM fixes to guest.
>
> However I'm not sure about point 2 at the light of mprotect. With
> mprotect the guest virutal addresses mapped by the guest user ptes
> will be used. It's not like kscand that may write forever to the user
> ptes without ever using the guest virtual addresses that they're
> mapping. So we better be sure that by unshadowing and optimizing
> kscand we're not hurting mprotect or other pte mangling operations in
> 2.6 that will likely keep accessing the guest virtual addresses mapped
> by the user ptes previously modified.
>
> Hope this makes any sense, I'm not sure to understand this completely.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 14:35 ` Avi Kivity
@ 2008-05-28 19:49 ` David S. Ahern
2008-05-29 6:37 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-28 19:49 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
I have a clone of the kvm repository, but evidently not running the
right magic to see the changes in the per-page-pte-tracking branch. I
ran the following:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
git branch per-page-pte-tracking
[dsa@daahern-lx kvm]$ git branch
master
* per-page-pte-tracking
But arch/x86/kvm/mmu.c does not show the changes for the
per-page-pte-history.patch.
What I am not doing correctly here?
david
Avi Kivity wrote:
> David S. Ahern wrote:
>> Weird. Could it be something about the hosts?
>>
>> I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13
>> GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel.
>>
>> I'll rebuild kvm-69 with your latest patch and try the test programs
>> again.
>>
>
> I've pushed it into kvm.git, branch name per-page-pte-tracking.
>
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 19:49 ` David S. Ahern
@ 2008-05-29 6:37 ` Avi Kivity
0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 6:37 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
David S. Ahern wrote:
> I have a clone of the kvm repository, but evidently not running the
> right magic to see the changes in the per-page-pte-tracking branch. I
> ran the following:
>
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
> git branch per-page-pte-tracking
>
> [dsa@daahern-lx kvm]$ git branch
> master
> * per-page-pte-tracking
>
> But arch/x86/kvm/mmu.c does not show the changes for the
> per-page-pte-history.patch.
>
> What I am not doing correctly here?
>
>
'git branch' creates a new branch. Try the following
git fetch origin
git checkout origin/per-page-pte-tracking
If that doesn't work (old git) try
git fetch git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
per-page-pte-tracking:refs/heads/per-page-pte-tracking
git checkout per-page-pte-tracking
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 17:04 ` Andrea Arcangeli
2008-05-28 17:24 ` David S. Ahern
@ 2008-05-29 10:01 ` Avi Kivity
2008-05-29 14:27 ` Andrea Arcangeli
1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 10:01 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: David S. Ahern, kvm
Andrea Arcangeli wrote:
>
>>> - set up kmap to point at pte
>>> - test_and_clear_bit(pte)
>>> - kunmap
>>>
>>> From kvm's point of view this looks like
>>>
>>> - several accesses to set up the kmap
>>>
>
> Hmm, the kmap establishment takes a single guest operation in the
> fixmap area. That's a single write to the pte, to write a pte_t 8/4
> byte large region (PAE/non-PAE). The same pte_t is then cleared and
> flushed out of the tlb with a cpu-local invlpg during kunmap_atomic.
>
> I count 1 write here so far.
>
>
No, two:
static inline void set_pte(pte_t *ptep, pte_t pte)
{
ptep->pte_high = pte.pte_high;
smp_wmb();
ptep->pte_low = pte.pte_low;
}
>>> - if these accesses trigger flooding, we will have to tear down the
>>> shadow for this page, only to set it up again soon
>>>
>
> So the shadow mapping the fixmap area would be tear down by the
> flooding.
>
Before we started patching this, yes.
> Or is the shadow corresponding to the real user pte pointed by the
> fixmap, that is unshadowed by the flooding, or both/all?
>
>
After we started patching this, no, but with per-page-pte-history, yes
(correctly).
>>> - an access to the pte (emulted)
>>>
>
> Here I count the second write and this isn't done on the fixmap area
> like the first write above, but this is a write to the real user pte,
> pointed by the fixmap. So if this is emulated it means the shadow of
> the user pte pointing to the real data page is still active.
>
Right. But if we are scanning a page table linearly, it should be
unshadowed.
>
>>> - if this access _doesn't_ trigger flooding, we will have 512 unneeded
>>> emulations. The pte is worthless anyway since the accessed bit is clear
>>> (so we can't set up a shadow pte for it)
>>> - this bug was fixed
>>>
>
> You mean the accessed bit on fixmap pte used by kmap? Or the user pte
> pointed by the fixmap pte?
>
The user pte. After guest code runs test_and_clear_bit(accessed_bit,
ptep), we can't shadow that pte (all shadowed ptes must have the
accessed bit set in the corresponding guest pte, similar to how a tlb
entry can only exist if the accessed bit is set).
>
>>> - an access to tear down the kmap
>>>
>
> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
> matters).
>
Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set.
> I think what we should aim for is to quickly reach this condition:
>
> 1) always keep the fixmap/kmap pte_t shadowed and emulate the
> kmap/kunmap access so the test_and_clear_young done on the user pte
> doesn't require to re-establish the spte representing the fixmap
> virtual address. If we don't emulate fixmap we'll have to
> re-establish the spte during the write to the user pte, and
> tear it down again during kunmap_atomic. So there's not much doubt
> fixmap access emulation is worth it.
>
That is what is done by current HEAD.
418c6952ba9fd379059ed325ea5a3efe904fb7fd is responsible.
Note that there is an alternative: allow the kmap pte to be unshadowed,
and instead emulate the access through that pte (i.e. emulate the btc
instruction). I don't think it's worth it though because it hurts other
users of the fixmap page.
> 2) get rid of the user pte shadow mapping pointing to the user data so
> the test_and_clear of the young bitflag on the user pte will not be
> emulated and it'll run at full CPU speed through the shadow pte
> mapping corresponding to the fixmap virtual address
>
That's what per-page-pte-history is supposed to do. The first few
accesses are emulated, the next will be native.
It's still not full speed as the kmap setup has to be emulated (twice).
One possible optimization is that if we see the first part of the kmap
instantiation, we emulate a few more instructions before returning to
the guest. Xen does this IIRC.
> kscand pattern is the same as running mprotect on a 32bit 2.6
> kernel so it sounds worth optimizing for it, even if kscand may be
> unfixable without killall -STOP kscand or VM fixes to guest.
>
>
I'm no longer sure the access pattern is sequential, since I see
kmap_atomic() will not recreate the pte if its value has not changed
(unless HIGHMEM_DEBUG).
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 15:39 ` David S. Ahern
@ 2008-05-29 11:49 ` Avi Kivity
2008-05-29 12:10 ` Avi Kivity
1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 11:49 UTC (permalink / raw)
To: David S. Ahern; +Cc: Andrea Arcangeli, kvm
David S. Ahern wrote:
> I've been instrumenting the guest kernel as well. It's the scanning of
> the active lists that triggers a lot of calls to paging64_prefetch_page,
> and, as you guys know, correlates with the number of direct pages in the
> list. Earlier in this thread I traced the kvm cycles to
> paging64_prefetch_page(). See
>
I optimized this function a bit, hopefully it will relieve some of the
pain. We still need to reduce the number of times it is called.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 15:39 ` David S. Ahern
2008-05-29 11:49 ` Avi Kivity
@ 2008-05-29 12:10 ` Avi Kivity
2008-05-29 13:49 ` David S. Ahern
1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 12:10 UTC (permalink / raw)
To: David S. Ahern; +Cc: Andrea Arcangeli, kvm
David S. Ahern wrote:
> I've been instrumenting the guest kernel as well. It's the scanning of
> the active lists that triggers a lot of calls to paging64_prefetch_page,
> and, as you guys know, correlates with the number of direct pages in the
> list. Earlier in this thread I traced the kvm cycles to
> paging64_prefetch_page(). See
>
> http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg16332.html
>
> In the guest I started capturing scans (kscand() loop) that took longer
> than a jiffie. Here's an example for 1 trip through the active lists,
> both anonymous and cache:
>
> active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct
> 36234, dj 225
>
>
HZ=512, so half a second.
41K pages in 0.5s -> 80K pages/sec. Considering we have _at_least_ two
emulations per page, this is almost reasonable.
> active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct 1249, dj 3
>
> active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct
> 84829, dj 848
>
Here we scanned 100K pages in ~2 seconds. 50K pages/sec, not too good.
> I'll pull down the git branch and give it a spin.
>
I've rebased it again to include the prefetch_page optimization.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-29 12:10 ` Avi Kivity
@ 2008-05-29 13:49 ` David S. Ahern
2008-05-29 14:08 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-29 13:49 UTC (permalink / raw)
To: Avi Kivity; +Cc: Andrea Arcangeli, kvm
This is 2.4/RHEL3, so HZ=100. 848 jiffies = 8.48 seconds -- and that's
just the one age bucket and this is just one example pulled randomly
(well after boot). During that time kscand does get scheduled out, but
ultimately guest time is at 100% during the scans.
david
Avi Kivity wrote:
> David S. Ahern wrote:
>> I've been instrumenting the guest kernel as well. It's the scanning of
>> the active lists that triggers a lot of calls to paging64_prefetch_page,
>> and, as you guys know, correlates with the number of direct pages in the
>> list. Earlier in this thread I traced the kvm cycles to
>> paging64_prefetch_page(). See
>>
>> http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg16332.html
>>
>> In the guest I started capturing scans (kscand() loop) that took longer
>> than a jiffie. Here's an example for 1 trip through the active lists,
>> both anonymous and cache:
>>
>> active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct
>> 36234, dj 225
>>
>>
>
> HZ=512, so half a second.
>
> 41K pages in 0.5s -> 80K pages/sec. Considering we have _at_least_ two
> emulations per page, this is almost reasonable.
>
>> active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct
>> 1249, dj 3
>>
>> active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct
>> 84829, dj 848
>>
>
> Here we scanned 100K pages in ~2 seconds. 50K pages/sec, not too good.
>
>> I'll pull down the git branch and give it a spin.
>>
>
> I've rebased it again to include the prefetch_page optimization.
>
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-29 13:49 ` David S. Ahern
@ 2008-05-29 14:08 ` Avi Kivity
0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 14:08 UTC (permalink / raw)
To: David S. Ahern; +Cc: Andrea Arcangeli, kvm
David S. Ahern wrote:
> This is 2.4/RHEL3, so HZ=100. 848 jiffies = 8.48 seconds -- and that's
> just the one age bucket and this is just one example pulled randomly
> (well after boot). During that time kscand does get scheduled out, but
> ultimately guest time is at 100% during the scans.
>
>
Er, yes. Don't know where that CONFIG_HZ=512 came from in the centos
config files:
That's pretty bad, then.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-29 10:01 ` Avi Kivity
@ 2008-05-29 14:27 ` Andrea Arcangeli
2008-05-29 15:11 ` David S. Ahern
2008-05-29 15:16 ` Avi Kivity
0 siblings, 2 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2008-05-29 14:27 UTC (permalink / raw)
To: Avi Kivity; +Cc: David S. Ahern, kvm
On Thu, May 29, 2008 at 01:01:06PM +0300, Avi Kivity wrote:
> No, two:
>
> static inline void set_pte(pte_t *ptep, pte_t pte)
> {
> ptep->pte_high = pte.pte_high;
> smp_wmb();
> ptep->pte_low = pte.pte_low;
> }
Right, that can be 2 or 1 depending on PAE non-PAE, other 2.4
enterprise distro with pte-highmem ships non-PAE kernels by default.
>>>> - if these accesses trigger flooding, we will have to tear down the
>>>> shadow for this page, only to set it up again soon
>>>>
>>
>> So the shadow mapping the fixmap area would be tear down by the
>> flooding.
>>
>
> Before we started patching this, yes.
Ok so now the one/two writes to the guest fixmap virt address are
emulated and the spte isn't tear down.
>
>> Or is the shadow corresponding to the real user pte pointed by the
>> fixmap, that is unshadowed by the flooding, or both/all?
>>
>>
>
> After we started patching this, no, but with per-page-pte-history, yes
> (correctly).
So with the per-page-pte-history the shadow representing the guest
user pte that is being modified by page_referenced is unshadowed.
>>>> - an access to the pte (emulted)
>>>>
>>
>> Here I count the second write and this isn't done on the fixmap area
>> like the first write above, but this is a write to the real user pte,
>> pointed by the fixmap. So if this is emulated it means the shadow of
>> the user pte pointing to the real data page is still active.
>>
>
> Right. But if we are scanning a page table linearly, it should be
> unshadowed.
I think we're often not scanning page table linearly with pte_chains,
but yet those should be still unshadowed. mmaps won't always bring
memory in linear order, memory isn't always initialized or by memset
or pagedin with contiguous virtual accesses.
So while the assumption that following the active list will sometime
return guest ptes that maps contiguous guest virtual address is valid,
it only accounts for a small percentage of the active list. It largely
depends on the userland apps. Furthermore even if the active lru is
initially pointing to linear ptes, the list is then split into age
buckets depending on the access patterns at runtime, so that further
fragments the linearity of the virtual addresses of the kmapped ptes.
BTW, one thing we didn't account for in previous email, is that there
can be more than one guest user pte modified by page_referenced, if
it's not a direct page. And non direct pages surely won't provide
linear scans, infact for non linear pages the most common is that the
pte_t will point to the same virtual address but on a different
pgd_t * (and in turn on a different pmd_t).
>>>> - if this access _doesn't_ trigger flooding, we will have 512 unneeded
>>>> emulations. The pte is worthless anyway since the accessed bit is clear
>>>> (so we can't set up a shadow pte for it)
>>>> - this bug was fixed
>>>>
>>
>> You mean the accessed bit on fixmap pte used by kmap? Or the user pte
>> pointed by the fixmap pte?
>>
>
> The user pte. After guest code runs test_and_clear_bit(accessed_bit,
> ptep), we can't shadow that pte (all shadowed ptes must have the accessed
> bit set in the corresponding guest pte, similar to how a tlb entry can only
> exist if the accessed bit is set).
Is this software invariant to ensure that we'll refresh the accessed
bit on the user pte too?
I assume this is needed because otherwise if we clear the accessed bit
on the shadow pte and we clear it on the user pte, when the shadow is
mapped in the TLB again the accessed bit will be set on the shadow in
hardware, but not on the user pte because the accessed bit is set on
the spte without kvm page fault.
So this means kscand by clearing the accessed bitflag on them, should
automatically unshadowing all user ptes pointed by the fixmap pte.
So a secnd test_and_clear_bit on the same user pte will run through
the fixmap pte established by kmap_atomic without traps.
So this means when the user program run again, it'll find the user pte
unshadowed and it'll have to re-instantiate the shadow ptes with a kvm
page fault, that has the primary objective of marking the user pte
accessed again (to notify the next kscand pass that the data page
pointed by the user pte was used meanwhile).
If I understand correctly, the establishment of the shadow pte
corresponding to the user pte, will have to mark wrprotect the spte
corresponding to the fixmap pte because we need to intercept
modifications to shadowed guest ptes and the spte corresponding to the
fixmap guest pte is now pointing to a shadowed guest pte after the
program returns running.
Then when kscand runs again, for the pages that have been faulted in
by the user program, we'll trap the test_and_clear_bit happening
through the readonly spte corresponding to the fixmap guest pte, and
we'll unshadow the spte of the guest user pte again and we'll mark the
spte corresponding to the fixmap pte as read-write again, because of
the test_and_clear_bit tells us that we've to unshadow instead of
emulating.
>>>> - an access to tear down the kmap
>>>>
>>
>> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
>> matters).
>>
>
> Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set.
2.4 yes. 2.6 is will do similar to CONFIG_HIGHMEM_DEBUG.
2.4 without HIGHMEM_DEBUG sets the pte and invlpg in kmap_atomic and
does nothing in kunmap_atomic.
2.6 sets the pte in kmap_atomic, and clears it+invlpg in kunmap_atomic.
>> I think what we should aim for is to quickly reach this condition:
>>
>> 1) always keep the fixmap/kmap pte_t shadowed and emulate the
>> kmap/kunmap access so the test_and_clear_young done on the user pte
>> doesn't require to re-establish the spte representing the fixmap
>> virtual address. If we don't emulate fixmap we'll have to
>> re-establish the spte during the write to the user pte, and
>> tear it down again during kunmap_atomic. So there's not much doubt
>> fixmap access emulation is worth it.
>>
>
> That is what is done by current HEAD.
> 418c6952ba9fd379059ed325ea5a3efe904fb7fd is responsible.
Cool!
>
> Note that there is an alternative: allow the kmap pte to be unshadowed, and
> instead emulate the access through that pte (i.e. emulate the btc
> instruction). I don't think it's worth it though because it hurts other
> users of the fixmap page.
>> 2) get rid of the user pte shadow mapping pointing to the user data so
>> the test_and_clear of the young bitflag on the user pte will not be
>> emulated and it'll run at full CPU speed through the shadow pte
>> mapping corresponding to the fixmap virtual address
>>
>
> That's what per-page-pte-history is supposed to do. The first few accesses
> are emulated, the next will be native.
Why not to go native immediately when we notice a test_and_clear of
the accessed bit? First the ptes won't be in contiguous virtual
address order, so if the flooding of the sptes corresponding to the
guest user pte depends on the gpa of the guest user ptes being
contiguous it won't work well. But more importantly we've found a
test_and_clear_bit of the accessed bitflag, so we should unshadow the
user pte that is being marked "old" immediately without need to detect
any flooding.
> It's still not full speed as the kmap setup has to be emulated (twice).
Agreed, the 1/2/3 emulations on writes to the fixmap area during
kmap_atomic (1/2 for non-PAE/PAE and 1 further pte_clear on 2.6 or 2.4
debug-highmem) seems unavoidable.
But the test_and_clear_bit writprotect fault (when the guest user pte
is shadowed) should just unshadow the guest user pte, mark the spte
representing the fixmap pte as writeable, and return immediately to
guest mode to actually run test_and_clear_bit natively without writing
it through emulation.
Noticing the test_and_clear_bit also requires a bit of instruction
"detection", but once we detected it from the eip address, we don't
have to write anything to the guest.
But I guess I'm missing something...
> One possible optimization is that if we see the first part of the kmap
> instantiation, we emulate a few more instructions before returning to the
> guest. Xen does this IIRC.
Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not
sure if 32bit PAE is that important to do this. Most 32bit enterprise
kernels I worked aren't compiled with PAE, only one called bigsmp is.
Also on 2.6, we could get the same benefit by making 2.6 at least as
optimal as 2.4 by never clearing the fixmap pte and by doing invlpg
only after setting it to a new value. Xen can't optimize that write in
kunmap_atomic.
2.6 has debug enabled by default for no good reason. So that would be
the first optimization to do as it saves a few cycles per
kunmap_atomic on host too.
> I'm no longer sure the access pattern is sequential, since I see
> kmap_atomic() will not recreate the pte if its value has not changed
> (unless HIGHMEM_DEBUG).
Hmm kmap_atomic always writes a new value to the fixmap pte, even if
it was mapping the same user pte as before.
static inline void *kmap_atomic(struct page *page, enum km_type type)
{
enum fixed_addresses idx;
unsigned long vaddr;
if (page < highmem_start_page)
return page_address(page);
idx = type + KM_TYPE_NR*smp_processor_id();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
#if HIGHMEM_DEBUG
if (!pte_none(*(kmap_pte-idx)))
out_of_line_bug();
#endif
set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
__flush_tlb_one(vaddr);
return (void*) vaddr;
}
In 2.6 does too, because it does the debug pte_clear in kunmap_atomic.
In theory even host could do pte_same() and avoid an invlpg if it
didn't change, but I'm unsure how frequently we remap the same page,
the pte loops like mprotect will map the 4k large pte, and loop over
it once it's mapped by the fixmap virtual address. So frequent
repetitions of remapping of the same page with kmap_atomic sounds
unlikely.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-29 14:27 ` Andrea Arcangeli
@ 2008-05-29 15:11 ` David S. Ahern
2008-05-29 15:16 ` Avi Kivity
1 sibling, 0 replies; 73+ messages in thread
From: David S. Ahern @ 2008-05-29 15:11 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Avi Kivity, kvm
Andrea Arcangeli wrote:
> On Thu, May 29, 2008 at 01:01:06PM +0300, Avi Kivity wrote:
>> No, two:
>>
>> static inline void set_pte(pte_t *ptep, pte_t pte)
>> {
>> ptep->pte_high = pte.pte_high;
>> smp_wmb();
>> ptep->pte_low = pte.pte_low;
>> }
>
> Right, that can be 2 or 1 depending on PAE non-PAE, other 2.4
> enterprise distro with pte-highmem ships non-PAE kernels by default.
RHEL3U8 has CONFIG_X86_PAE set.
<snipped>
>>>>> - an access to tear down the kmap
>>>>>
>>> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
>>> matters).
>>>
>> Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set.
>
> 2.4 yes. 2.6 is will do similar to CONFIG_HIGHMEM_DEBUG.
>
> 2.4 without HIGHMEM_DEBUG sets the pte and invlpg in kmap_atomic and
> does nothing in kunmap_atomic.
>
> 2.6 sets the pte in kmap_atomic, and clears it+invlpg in kunmap_atomic.
CONFIG_DEBUG_HIGHMEM is set.
<snipped>
>> One possible optimization is that if we see the first part of the kmap
>> instantiation, we emulate a few more instructions before returning to the
>> guest. Xen does this IIRC.
>
> Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not
> sure if 32bit PAE is that important to do this. Most 32bit enterprise
> kernels I worked aren't compiled with PAE, only one called bigsmp is.
RHEL3 has a hugemem kernel which basically just enables the 4G/4G split.
My guest with the hugemem kernel runs much better than the standard smp
kernel.
If you care to download it the RHEL3U8 kernel source is posted here:
ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/3AS/en/os/SRPMS/kernel-2.4.21-47.EL.src.rpm
Red Hat does heavily patch kernels, so they will be dramatically
different than the kernel.org kernel with the same number.
david
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-29 14:27 ` Andrea Arcangeli
2008-05-29 15:11 ` David S. Ahern
@ 2008-05-29 15:16 ` Avi Kivity
2008-05-30 13:12 ` Andrea Arcangeli
1 sibling, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-29 15:16 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: David S. Ahern, kvm
Andrea Arcangeli wrote:
>>> Here I count the second write and this isn't done on the fixmap area
>>> like the first write above, but this is a write to the real user pte,
>>> pointed by the fixmap. So if this is emulated it means the shadow of
>>> the user pte pointing to the real data page is still active.
>>>
>>>
>> Right. But if we are scanning a page table linearly, it should be
>> unshadowed.
>>
>
> I think we're often not scanning page table linearly with pte_chains,
> but yet those should be still unshadowed. mmaps won't always bring
> memory in linear order, memory isn't always initialized or by memset
> or pagedin with contiguous virtual accesses.
>
>
I guess we aren't scanning the page table linerarly, since with the
linear-scan test case I can't reproduce the problem.
> So while the assumption that following the active list will sometime
> return guest ptes that maps contiguous guest virtual address is valid,
> it only accounts for a small percentage of the active list. It largely
> depends on the userland apps. Furthermore even if the active lru is
> initially pointing to linear ptes, the list is then split into age
> buckets depending on the access patterns at runtime, so that further
> fragments the linearity of the virtual addresses of the kmapped ptes.
>
> BTW, one thing we didn't account for in previous email, is that there
> can be more than one guest user pte modified by page_referenced, if
> it's not a direct page. And non direct pages surely won't provide
> linear scans, infact for non linear pages the most common is that the
> pte_t will point to the same virtual address but on a different
> pgd_t * (and in turn on a different pmd_t).
>
>
Since the pte tracking is per-page, it won't be affected by shared pages.
>>> You mean the accessed bit on fixmap pte used by kmap? Or the user pte
>>> pointed by the fixmap pte?
>>>
>>>
>> The user pte. After guest code runs test_and_clear_bit(accessed_bit,
>> ptep), we can't shadow that pte (all shadowed ptes must have the accessed
>> bit set in the corresponding guest pte, similar to how a tlb entry can only
>> exist if the accessed bit is set).
>>
>
> Is this software invariant to ensure that we'll refresh the accessed
> bit on the user pte too?
>
>
Yes. We need a fault in order to set the guest accessed bit.
> So this means kscand by clearing the accessed bitflag on them, should
> automatically unshadowing all user ptes pointed by the fixmap pte.
>
> So a secnd test_and_clear_bit on the same user pte will run through
> the fixmap pte established by kmap_atomic without traps.
>
> So this means when the user program run again, it'll find the user pte
> unshadowed and it'll have to re-instantiate the shadow ptes with a kvm
> page fault, that has the primary objective of marking the user pte
> accessed again (to notify the next kscand pass that the data page
> pointed by the user pte was used meanwhile).
>
> If I understand correctly, the establishment of the shadow pte
> corresponding to the user pte, will have to mark wrprotect the spte
> corresponding to the fixmap pte because we need to intercept
> modifications to shadowed guest ptes and the spte corresponding to the
> fixmap guest pte is now pointing to a shadowed guest pte after the
> program returns running.
>
> Then when kscand runs again, for the pages that have been faulted in
> by the user program, we'll trap the test_and_clear_bit happening
> through the readonly spte corresponding to the fixmap guest pte, and
> we'll unshadow the spte of the guest user pte again and we'll mark the
> spte corresponding to the fixmap pte as read-write again, because of
> the test_and_clear_bit tells us that we've to unshadow instead of
> emulating.
>
Yes.
>>> 2) get rid of the user pte shadow mapping pointing to the user data so
>>> the test_and_clear of the young bitflag on the user pte will not be
>>> emulated and it'll run at full CPU speed through the shadow pte
>>> mapping corresponding to the fixmap virtual address
>>>
>>>
>> That's what per-page-pte-history is supposed to do. The first few accesses
>> are emulated, the next will be native.
>>
>
> Why not to go native immediately when we notice a test_and_clear of
> the accessed bit? First the ptes won't be in contiguous virtual
> address order, so if the flooding of the sptes corresponding to the
> guest user pte depends on the gpa of the guest user ptes being
> contiguous it won't work well. But more importantly we've found a
> test_and_clear_bit of the accessed bitflag, so we should unshadow the
> user pte that is being marked "old" immediately without need to detect
> any flooding.
>
Unshadowing a page is expensive, both in immediate cost, and in future
cost of reshadowing the page and taking faults. It's worthwhile to be
sure the guest really doesn't want it as a page table.
>> It's still not full speed as the kmap setup has to be emulated (twice).
>>
>
> Agreed, the 1/2/3 emulations on writes to the fixmap area during
> kmap_atomic (1/2 for non-PAE/PAE and 1 further pte_clear on 2.6 or 2.4
> debug-highmem) seems unavoidable.
>
> But the test_and_clear_bit writprotect fault (when the guest user pte
> is shadowed) should just unshadow the guest user pte, mark the spte
> representing the fixmap pte as writeable, and return immediately to
> guest mode to actually run test_and_clear_bit natively without writing
> it through emulation.
>
> Noticing the test_and_clear_bit also requires a bit of instruction
> "detection", but once we detected it from the eip address, we don't
> have to write anything to the guest.
>
> But I guess I'm missing something...
>
>
If the pages are not scanned linearly, then unshadowing may not help.
Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables.
Well, then after 4000 scans we ought to have unshadowed everything. So
I guess per-page-pte-history is broken, can't explain it otherwise.
>> One possible optimization is that if we see the first part of the kmap
>> instantiation, we emulate a few more instructions before returning to the
>> guest. Xen does this IIRC.
>>
>
> Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not
> sure if 32bit PAE is that important to do this. Most 32bit enterprise
> kernels I worked aren't compiled with PAE, only one called bigsmp is.
>
>
Well, seems RHEL 3.8 smp is PAE.
> Also on 2.6, we could get the same benefit by making 2.6 at least as
> optimal as 2.4 by never clearing the fixmap pte and by doing invlpg
> only after setting it to a new value. Xen can't optimize that write in
> kunmap_atomic.
>
> 2.6 has debug enabled by default for no good reason. So that would be
> the first optimization to do as it saves a few cycles per
> kunmap_atomic on host too.
>
>
Yes, it's probably a small win on native as well.
>> I'm no longer sure the access pattern is sequential, since I see
>> kmap_atomic() will not recreate the pte if its value has not changed
>> (unless HIGHMEM_DEBUG).
>>
>
> Hmm kmap_atomic always writes a new value to the fixmap pte, even if
> it was mapping the same user pte as before.
>
> static inline void *kmap_atomic(struct page *page, enum km_type type)
> {
> enum fixed_addresses idx;
> unsigned long vaddr;
>
> if (page < highmem_start_page)
> return page_address(page);
>
> idx = type + KM_TYPE_NR*smp_processor_id();
> vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
> #if HIGHMEM_DEBUG
> if (!pte_none(*(kmap_pte-idx)))
> out_of_line_bug();
> #endif
> set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
> __flush_tlb_one(vaddr);
>
> return (void*) vaddr;
> }
>
>
The centos 3.8 sources have
static inline void *__kmap_atomic(struct page *page, enum km_type type)
{
enum fixed_addresses idx;
unsigned long vaddr;
idx = type + KM_TYPE_NR*smp_processor_id();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
#if HIGHMEM_DEBUG
if (!pte_none(*(kmap_pte-idx)))
out_of_line_bug();
#else
/*
* Performance optimization - do not flush if the new
* pte is the same as the old one:
*/
if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot)))
return (void *) vaddr;
#endif
set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
__flush_tlb_one(vaddr);
return (void*) vaddr;
}
(linux-2.4.21-47.EL)
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-28 10:51 ` Avi Kivity
2008-05-28 14:13 ` David S. Ahern
@ 2008-05-29 16:42 ` David S. Ahern
2008-05-31 8:16 ` Avi Kivity
1 sibling, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-05-29 16:42 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
[-- Attachment #1: Type: text/plain, Size: 5838 bytes --]
Avi Kivity wrote:
> David S. Ahern wrote:
>> The short answer is that I am still see large system time hiccups in the
>> guests due to kscand in the guest scanning its active lists. I do see
>> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
>> completeness I also tried a history of 2, but it performed worse than 3
>> which is no surprise given the meaning of it.)
>>
>>
>> I have been able to scratch out a simplistic program that stimulates
>> kscand activity similar to what is going on in my real guest (see
>> attached). The program requests a memory allocation, initializes it (to
>> get it backed) and then in a loop sweeps through the memory in chunks
>> similar to a program using parts of its memory here and there but
>> eventually accessing all of it.
>>
>> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
>> using a fair amount of highmem. Start a couple of instances of the
>> attached. For example, I've been using these 2:
>>
>> memuser 768M 120 5 300
>> memuser 384M 300 10 600
>>
>> Together these instances take up a 1GB of RAM and once initialized
>> consume very little CPU. On kvm they make kscand and kswapd go nuts
>> every 5-15 minutes. For comparison, I do not see the same behavior for
>> an identical setup running on esx 3.5.
>>
>
> I haven't been able to reproduce this:
>
>> [root@localhost root]# ps -elf | grep -E 'memuser|kscand'
>> 1 S root 7 1 1 75 0 - 0 schedu 10:07 ?
>> 00:00:26 [kscand]
>> 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0
>> 00:00:21 ./memuser 768M 120 5 300
>> 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0
>> 00:00:10 ./memuser 384M 300 10 600
>> 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0
>> 00:00:00 grep -E memuser|kscand
>
> The workload has been running for about half an hour, and kswapd cpu
> usage doesn't seem significant. This is a 2GB guest running with my
> patch ported to kvm.git HEAD. Guest is has 2G of memory.
>
I'm running on the per-page-pte-tracking branch, and I am still seeing it.
I doubt you want to sit and watch the screen for an hour, so install sysstat if not already, change the sample rate to 1 minute (/etc/cron.d/sysstat), let the server run for a few hours and then run 'sar -u'. You'll see something like this:
10:12:11 AM LINUX RESTART
10:13:03 AM CPU %user %nice %system %iowait %idle
10:14:01 AM all 0.08 0.00 2.08 0.35 97.49
10:15:03 AM all 0.05 0.00 0.79 0.04 99.12
10:15:59 AM all 0.15 0.00 1.52 0.06 98.27
10:17:01 AM all 0.04 0.00 0.69 0.04 99.23
10:17:59 AM all 0.01 0.00 0.39 0.00 99.60
10:18:59 AM all 0.00 0.00 0.12 0.02 99.87
10:20:02 AM all 0.18 0.00 14.62 0.09 85.10
10:21:01 AM all 0.71 0.00 26.35 0.01 72.94
10:22:02 AM all 0.67 0.00 10.61 0.00 88.72
10:22:59 AM all 0.14 0.00 1.80 0.00 98.06
10:24:03 AM all 0.13 0.00 0.50 0.00 99.37
10:24:59 AM all 0.09 0.00 11.46 0.00 88.45
10:26:03 AM all 0.16 0.00 0.69 0.03 99.12
10:26:59 AM all 0.14 0.00 10.01 0.02 89.83
10:28:03 AM all 0.57 0.00 2.20 0.03 97.20
Average: all 0.21 0.00 5.55 0.05 94.20
every one of those jumps in %system time directly correlates to kscand activity. Without the memuser programs running the guest %system time is <1%. The point of this silly memuser program is just to use high memory -- let it age, then make it active again, sit idle, repeat. If you run kvm_stat with -l in the host you'll see the jump in pte writes/updates. An intern here added a timestamp to the kvm_stat output for me which helps to directly correlate guest/host data.
I also ran my real guest on the branch. Performance at boot through the first 15 minutes was much better, but I'm still seeing recurring hits every 5 minutes when kscand kicks in. Here's the data from the guest for the first one which happened after 15 minutes of uptime:
active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59
active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 103
active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212
The kvm_stat data for this time period is attached due to line lengths.
Also, I forgot to mention this before, but there is a bug in the kscand code in the RHEL3U8 kernel. When it scans the cache list it uses the count from the anonymous list:
if (need_active_cache_scan(zone)) {
for (age = MAX_AGE-1; age >= 0; age--) {
scan_active_list(zone, age,
&zone->active_cache_list[age],
zone->active_anon_count[age]);
^^^^^^^^^^^^^^^^^
if (current->need_resched)
schedule();
}
}
When the anonymous count is higher it is scanning the cache list repeatedly. An example of that was captured here:
active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon 111967, direct 626, dj 3
count anon is active_anon_count[age] which at this moment was 111,967. There were only 222 entries in the cache list, but the count value passed to scan_active_list was 111,967. When the cache list has a lot of direct pages, that causes a larger hit on kvm than needed. That said, I have to live with the bug in the guest.
david
[-- Attachment #2: kvm_stat.kscand --]
[-- Type: text/plain, Size: 2650 bytes --]
kvm-69/kvm_stat -f 'mmu*|pf*' -l:
mmio_exit mmu_cache mmu_flood mmu_pde_z mmu_pte_u mmu_pte_w mmu_recyc mmu_shado pf_fixed pf_guest
182 18 18 0 5664 5682 0 18 5720 21
211 59 59 0 7040 7105 0 59 7348 99
81 0 48 0 45861 45909 0 48 45910 1
209 683 814 0 178527 179405 0 814 181410 9
67 111 320 0 175602 175922 0 320 177202 0
28 0 29 0 181365 181394 0 29 181394 0
7 0 22 0 181834 181856 0 22 181855 0
35 0 14 0 180129 180143 0 14 180144 0
7 0 10 0 179141 179151 0 10 179150 0
35 0 3 0 181359 181361 0 3 181362 0
7 0 4 0 181565 181570 0 4 181570 0
21 0 3 0 181435 181437 0 3 181437 0
21 0 4 0 181281 181286 0 4 181285 0
21 0 3 0 179444 179447 0 3 179448 0
91 0 61 0 179841 179902 0 61 179902 0
7 0 247 0 176628 176875 0 247 176874 0
313 478 133 1 100486 100604 0 133 126690 80
162 21 18 0 6361 6379 0 18 6584 5
294 40 23 21 9144 9188 0 25 9544 45
143 5 1 0 5026 5027 0 1 5502 1
The above corresponds to the following from the guest:
active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59
active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 103
active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-29 15:16 ` Avi Kivity
@ 2008-05-30 13:12 ` Andrea Arcangeli
2008-05-31 7:39 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: Andrea Arcangeli @ 2008-05-30 13:12 UTC (permalink / raw)
To: Avi Kivity; +Cc: David S. Ahern, kvm
On Thu, May 29, 2008 at 06:16:55PM +0300, Avi Kivity wrote:
> Yes. We need a fault in order to set the guest accessed bit.
So what I'm missing now is how the spte corresponding to the user pte
that is under test_and_clear to clear the accessed bit, will not the
zapped immediately. If we don't zap it immediately, how do we set the
accessed bit again on the user pte, when the user program returned
running and used that shadow pte to access the program data after the
kscand pass?
Or am I missing something?
> Unshadowing a page is expensive, both in immediate cost, and in future cost
> of reshadowing the page and taking faults. It's worthwhile to be sure the
> guest really doesn't want it as a page table.
Ok that makes sense, but can we defer the unshadowing while still
emulating the accessed bit correctly on the user pte?
> If the pages are not scanned linearly, then unshadowing may not help.
It should help the second time kscand runs, for the user ptes that
aren't shadowed anymore, the second pass won't require any emulation
for test_and_bit because the spte of the fixmap area will be
read-write. The bug that passes the anonymous pages number instead of
the cache number will lead to many more test_and_clear than needed,
and not all user ptes may be used in between two different kscand passes.
> Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables.
There are likely 1500 ptes in highmem. (ram isn't the most important factor)
> Well, then after 4000 scans we ought to have unshadowed everything. So I
> guess per-page-pte-history is broken, can't explain it otherwise.
Yes, we should have unshadowed all user ptes after 4000 scans and then
the test_and_clear shouldn't require any more emulation, there will be
only 3 emulations for each kmap_atomic/kunmap_atomic.
> static inline void *__kmap_atomic(struct page *page, enum km_type type)
> {
> enum fixed_addresses idx;
> unsigned long vaddr;
>
> idx = type + KM_TYPE_NR*smp_processor_id();
> vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
> #if HIGHMEM_DEBUG
> if (!pte_none(*(kmap_pte-idx)))
> out_of_line_bug();
> #else
> /*
> * Performance optimization - do not flush if the new
> * pte is the same as the old one:
> */
> if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot)))
> return (void *) vaddr;
> #endif
> set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
> __flush_tlb_one(vaddr);
>
> return (void*) vaddr;
> }
It's weird they optimized this if they enabled
CONFIG_HIGHMEM_DEBUG. Anyway it doesn't make a whole lot of difference
as it's an unlikely condition.
> (linux-2.4.21-47.EL)
Downloaded it now.
I think it should be clear that by now, we're trying to be
bug-compatile like the host here, and optimizing for 2.6 kmaps.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-30 13:12 ` Andrea Arcangeli
@ 2008-05-31 7:39 ` Avi Kivity
0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-05-31 7:39 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: David S. Ahern, kvm
Andrea Arcangeli wrote:
> On Thu, May 29, 2008 at 06:16:55PM +0300, Avi Kivity wrote:
>
>> Yes. We need a fault in order to set the guest accessed bit.
>>
>
> So what I'm missing now is how the spte corresponding to the user pte
> that is under test_and_clear to clear the accessed bit, will not the
> zapped immediately. If we don't zap it immediately, how do we set the
> accessed bit again on the user pte, when the user program returned
> running and used that shadow pte to access the program data after the
> kscand pass?
>
>
The spte is zapped unconditionally in kvm_mmu_pte_write(), and not
re-established in mmu_pte_write_new_pte() due to the missing accessed bit.
The question is whether to tear down the shadow page it is contained in,
or not.
> Or am I missing something?
>
>
>> Unshadowing a page is expensive, both in immediate cost, and in future cost
>> of reshadowing the page and taking faults. It's worthwhile to be sure the
>> guest really doesn't want it as a page table.
>>
>
> Ok that makes sense, but can we defer the unshadowing while still
> emulating the accessed bit correctly on the user pte?
>
>
We do, unless there's a bad bug somewhere.
>> If the pages are not scanned linearly, then unshadowing may not help.
>>
>
> It should help the second time kscand runs, for the user ptes that
> aren't shadowed anymore, the second pass won't require any emulation
> for test_and_bit because the spte of the fixmap area will be
> read-write. The bug that passes the anonymous pages number instead of
> the cache number will lead to many more test_and_clear than needed,
> and not all user ptes may be used in between two different kscand passes.
>
>
We still need 3 emulations per pte to set the fixmap entry. Unshadowing
saves one emulation on the pte itself.
>> Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables.
>>
>
> There are likely 1500 ptes in highmem. (ram isn't the most important factor)
>
>
I use 'pte' in the Intel manual sense (page table entry), not the Linux
sense (page table).
I mentioned these numbers to see the worst case behavior.
Non-highmem:
- with unshadow: O(500) accesses to unshadow the page tables, then
native speed
- without unshadow: O(250000) accesses to modify the ptes
Highmem:
- with unshadow: O(250000) accesses to update the fixmap entry
- with unshadow: O(250000) accesses to update the fixmap entry and to
modify the ptes
>> Well, then after 4000 scans we ought to have unshadowed everything. So I
>> guess per-page-pte-history is broken, can't explain it otherwise.
>>
>
> Yes, we should have unshadowed all user ptes after 4000 scans and then
> the test_and_clear shouldn't require any more emulation, there will be
> only 3 emulations for each kmap_atomic/kunmap_atomic.
>
>
So we save 25%. It's still bad even if everything is working correctly.
>
> I think it should be clear that by now, we're trying to be
> bug-compatile like the host here, and optimizing for 2.6 kmaps.
>
Don't understand.
I'm guessing esx gets its good performance by special-casing something.
For example, they can keep the fixmap page never shadowed, always
emulate accesses through the fixmap page, and recompile instructions
that go through fixmap to issue a hypercall.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-29 16:42 ` David S. Ahern
@ 2008-05-31 8:16 ` Avi Kivity
2008-06-02 16:42 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-05-31 8:16 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
David S. Ahern wrote:
>> I haven't been able to reproduce this:
>>
>>
>>> [root@localhost root]# ps -elf | grep -E 'memuser|kscand'
>>> 1 S root 7 1 1 75 0 - 0 schedu 10:07 ?
>>> 00:00:26 [kscand]
>>> 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0
>>> 00:00:21 ./memuser 768M 120 5 300
>>> 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0
>>> 00:00:10 ./memuser 384M 300 10 600
>>> 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0
>>> 00:00:00 grep -E memuser|kscand
>>>
>> The workload has been running for about half an hour, and kswapd cpu
>> usage doesn't seem significant. This is a 2GB guest running with my
>> patch ported to kvm.git HEAD. Guest is has 2G of memory.
>>
>>
>
> I'm running on the per-page-pte-tracking branch, and I am still seeing it.
>
> I doubt you want to sit and watch the screen for an hour, so install sysstat if not already, change the sample rate to 1 minute (/etc/cron.d/sysstat), let the server run for a few hours and then run 'sar -u'. You'll see something like this:
>
> 10:12:11 AM LINUX RESTART
>
> 10:13:03 AM CPU %user %nice %system %iowait %idle
> 10:14:01 AM all 0.08 0.00 2.08 0.35 97.49
> 10:15:03 AM all 0.05 0.00 0.79 0.04 99.12
> 10:15:59 AM all 0.15 0.00 1.52 0.06 98.27
> 10:17:01 AM all 0.04 0.00 0.69 0.04 99.23
> 10:17:59 AM all 0.01 0.00 0.39 0.00 99.60
> 10:18:59 AM all 0.00 0.00 0.12 0.02 99.87
> 10:20:02 AM all 0.18 0.00 14.62 0.09 85.10
> 10:21:01 AM all 0.71 0.00 26.35 0.01 72.94
> 10:22:02 AM all 0.67 0.00 10.61 0.00 88.72
> 10:22:59 AM all 0.14 0.00 1.80 0.00 98.06
> 10:24:03 AM all 0.13 0.00 0.50 0.00 99.37
> 10:24:59 AM all 0.09 0.00 11.46 0.00 88.45
> 10:26:03 AM all 0.16 0.00 0.69 0.03 99.12
> 10:26:59 AM all 0.14 0.00 10.01 0.02 89.83
> 10:28:03 AM all 0.57 0.00 2.20 0.03 97.20
> Average: all 0.21 0.00 5.55 0.05 94.20
>
>
> every one of those jumps in %system time directly correlates to kscand activity. Without the memuser programs running the guest %system time is <1%. The point of this silly memuser program is just to use high memory -- let it age, then make it active again, sit idle, repeat. If you run kvm_stat with -l in the host you'll see the jump in pte writes/updates. An intern here added a timestamp to the kvm_stat output for me which helps to directly correlate guest/host data.
>
>
> I also ran my real guest on the branch. Performance at boot through the first 15 minutes was much better, but I'm still seeing recurring hits every 5 minutes when kscand kicks in. Here's the data from the guest for the first one which happened after 15 minutes of uptime:
>
> active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59
>
> active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 103
>
> active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212
>
>
We touched 90,000 ptes in 12 seconds. That's 8,000 ptes per second.
Yet we see 180,000 page faults per second in the trace.
Oh! Only 45K pages were direct, so the other 45K were shared, with
perhaps many ptes. We shoud count ptes, not pages.
Can you modify page_referenced() to count the numbers of ptes mapped (1
for direct pages, nr_chains for indirect pages) and print the total
deltas in active_anon_scan?
> The kvm_stat data for this time period is attached due to line lengths.
>
>
> Also, I forgot to mention this before, but there is a bug in the kscand code in the RHEL3U8 kernel. When it scans the cache list it uses the count from the anonymous list:
>
> if (need_active_cache_scan(zone)) {
> for (age = MAX_AGE-1; age >= 0; age--) {
> scan_active_list(zone, age,
> &zone->active_cache_list[age],
> zone->active_anon_count[age]);
> ^^^^^^^^^^^^^^^^^
> if (current->need_resched)
> schedule();
> }
> }
>
> When the anonymous count is higher it is scanning the cache list repeatedly. An example of that was captured here:
>
> active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon 111967, direct 626, dj 3
>
> count anon is active_anon_count[age] which at this moment was 111,967. There were only 222 entries in the cache list, but the count value passed to scan_active_list was 111,967. When the cache list has a lot of direct pages, that causes a larger hit on kvm than needed. That said, I have to live with the bug in the guest.
>
For debugging, can you fix it? It certainly has a large impact.
Perhaps it is fixed in an update kernel. There's a 2.4.21-50.EL in the
centos 3.8 update repos.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-05-31 8:16 ` Avi Kivity
@ 2008-06-02 16:42 ` David S. Ahern
2008-06-05 8:37 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-06-02 16:42 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
Avi Kivity wrote:
> David S. Ahern wrote:
>>> I haven't been able to reproduce this:
>>>
>>>
>>>> [root@localhost root]# ps -elf | grep -E 'memuser|kscand'
>>>> 1 S root 7 1 1 75 0 - 0 schedu 10:07 ?
>>>> 00:00:26 [kscand]
>>>> 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0
>>>> 00:00:21 ./memuser 768M 120 5 300
>>>> 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0
>>>> 00:00:10 ./memuser 384M 300 10 600
>>>> 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0
>>>> 00:00:00 grep -E memuser|kscand
>>>>
>>> The workload has been running for about half an hour, and kswapd cpu
>>> usage doesn't seem significant. This is a 2GB guest running with my
>>> patch ported to kvm.git HEAD. Guest is has 2G of memory.
>>>
>>>
>>
>> I'm running on the per-page-pte-tracking branch, and I am still seeing
>> it.
>> I doubt you want to sit and watch the screen for an hour, so install
>> sysstat if not already, change the sample rate to 1 minute
>> (/etc/cron.d/sysstat), let the server run for a few hours and then run
>> 'sar -u'. You'll see something like this:
>>
>> 10:12:11 AM LINUX RESTART
>>
>> 10:13:03 AM CPU %user %nice %system %iowait %idle
>> 10:14:01 AM all 0.08 0.00 2.08 0.35 97.49
>> 10:15:03 AM all 0.05 0.00 0.79 0.04 99.12
>> 10:15:59 AM all 0.15 0.00 1.52 0.06 98.27
>> 10:17:01 AM all 0.04 0.00 0.69 0.04 99.23
>> 10:17:59 AM all 0.01 0.00 0.39 0.00 99.60
>> 10:18:59 AM all 0.00 0.00 0.12 0.02 99.87
>> 10:20:02 AM all 0.18 0.00 14.62 0.09 85.10
>> 10:21:01 AM all 0.71 0.00 26.35 0.01 72.94
>> 10:22:02 AM all 0.67 0.00 10.61 0.00 88.72
>> 10:22:59 AM all 0.14 0.00 1.80 0.00 98.06
>> 10:24:03 AM all 0.13 0.00 0.50 0.00 99.37
>> 10:24:59 AM all 0.09 0.00 11.46 0.00 88.45
>> 10:26:03 AM all 0.16 0.00 0.69 0.03 99.12
>> 10:26:59 AM all 0.14 0.00 10.01 0.02 89.83
>> 10:28:03 AM all 0.57 0.00 2.20 0.03 97.20
>> Average: all 0.21 0.00 5.55 0.05 94.20
>>
>>
>> every one of those jumps in %system time directly correlates to kscand
>> activity. Without the memuser programs running the guest %system time
>> is <1%. The point of this silly memuser program is just to use high
>> memory -- let it age, then make it active again, sit idle, repeat. If
>> you run kvm_stat with -l in the host you'll see the jump in pte
>> writes/updates. An intern here added a timestamp to the kvm_stat
>> output for me which helps to directly correlate guest/host data.
>>
>>
>> I also ran my real guest on the branch. Performance at boot through
>> the first 15 minutes was much better, but I'm still seeing recurring
>> hits every 5 minutes when kscand kicks in. Here's the data from the
>> guest for the first one which happened after 15 minutes of uptime:
>>
>> active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct
>> 24845, dj 59
>>
>> active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct
>> 40868, dj 103
>>
>> active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct
>> 45805, dj 1212
>>
>>
>
> We touched 90,000 ptes in 12 seconds. That's 8,000 ptes per second.
> Yet we see 180,000 page faults per second in the trace.
>
> Oh! Only 45K pages were direct, so the other 45K were shared, with
> perhaps many ptes. We shoud count ptes, not pages.
>
> Can you modify page_referenced() to count the numbers of ptes mapped (1
> for direct pages, nr_chains for indirect pages) and print the total
> deltas in active_anon_scan?
>
Here you go. I've shortened the line lengths to get them to squeeze into
80 columns:
anon_scan, all HighMem zone, 187,910 active pages at loop start:
count[12] 21462 -> 230, direct 20469, chains 3479, dj 58
count[11] 1338 -> 1162, direct 227, chains 26144, dj 59
count[8] 29397 -> 5410, direct 26115, chains 27617, dj 117
count[4] 35804 -> 25556, direct 31508, chains 82929, dj 256
count[3] 2738 -> 2207, direct 2680, chains 58, dj 7
count[0] 92580 -> 89509, direct 75024, chains 262834, dj 726
(age number is the index in [])
cache_scan, all HighMem zone, 48,298 active pages at loop start:
count[12] 3642 -> 2982, direct 499, chains 20022, dj 44
count[8] 11254 -> 11187, direct 7189, chains 9854, dj 37
count[4] 15709 -> 15702, direct 5071, chains 9388, dj 31
(with anon_cache_count bug fixed)
If you sum the direct pages and the chains count for each row, convert
dj into dt (divided by HZ = 100) you get:
( 20469 + 3479 ) / 0.58 = 41289
( 227 + 26144 ) / 0.59 = 44696
( 26115 + 27617 ) / 1.17 = 45924
( 31508 + 82929 ) / 2.56 = 44701
( 2680 + 58 ) / 0.07 = 39114
( 75024 + 262834 ) / 7.26 = 46536
( 499 + 20022 ) / 0.44 = 46638
( 7189 + 9854 ) / 0.37 = 46062
( 5071 + 9388 ) / 0.31 = 46641
For 4 pte writes per direct page or chain entry comes to ~187,000/sec
which is close to the total collected by kvm_stat (data width shrunk to
fit in e-mail; hope this is readable still):
|---------- mmu_ ----------|----- pf_ -----|
cache flood pde_z pte_u pte_w shado fixed guest
267 271 95 21455 21842 285 22840 165
66 88 0 12102 12224 88 12458 0
2042 2133 0 178146 180515 2133 188089 387
1053 1212 0 187067 188485 1212 193011 8
4771 4811 88 185129 190998 4825 207490 448
910 824 7 183066 184050 824 195836 12
707 785 0 176381 177300 785 180350 6
1167 1144 0 189618 191014 1144 195902 10
4238 4193 87 188381 193590 4206 207030 465
1448 1400 7 187786 189509 1400 198688 21
982 971 0 187880 189076 971 198405 2
1165 1208 0 190007 191503 1208 195746 13
1106 1146 0 189144 190550 1146 195143 0
4767 4788 96 185802 191704 4802 206362 477
1388 1431 0 187387 188991 1431 195115 3
584 551 0 77176 77802 551 84829 10
12 7 0 3601 3609 7 13497 4
243 153 91 31085 31333 167 35059 879
21 18 6 3130 3155 18 3827 2
21 4 1 4665 4670 4 6825 9
>> The kvm_stat data for this time period is attached due to line lengths.
>>
>>
>> Also, I forgot to mention this before, but there is a bug in the
>> kscand code in the RHEL3U8 kernel. When it scans the cache list it
>> uses the count from the anonymous list:
>>
>> if (need_active_cache_scan(zone)) {
>> for (age = MAX_AGE-1; age >= 0; age--) {
>> scan_active_list(zone, age,
>> &zone->active_cache_list[age],
>> zone->active_anon_count[age]);
>> ^^^^^^^^^^^^^^^^^
>> if (current->need_resched)
>> schedule();
>> }
>> }
>>
>> When the anonymous count is higher it is scanning the cache list
>> repeatedly. An example of that was captured here:
>>
>> active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon
>> 111967, direct 626, dj 3
>>
>> count anon is active_anon_count[age] which at this moment was 111,967.
>> There were only 222 entries in the cache list, but the count value
>> passed to scan_active_list was 111,967. When the cache list has a lot
>> of direct pages, that causes a larger hit on kvm than needed. That
>> said, I have to live with the bug in the guest.
>>
>
> For debugging, can you fix it? It certainly has a large impact.
>
yes, I have run a few tests with it fixed to get a ballpark on the
impact. The fix is included in the number above.
> Perhaps it is fixed in an update kernel. There's a 2.4.21-50.EL in the
> centos 3.8 update repos.
>
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-06-02 16:42 ` David S. Ahern
@ 2008-06-05 8:37 ` Avi Kivity
2008-06-05 16:20 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-06-05 8:37 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
David S. Ahern wrote:
>> Oh! Only 45K pages were direct, so the other 45K were shared, with
>> perhaps many ptes. We shoud count ptes, not pages.
>>
>> Can you modify page_referenced() to count the numbers of ptes mapped (1
>> for direct pages, nr_chains for indirect pages) and print the total
>> deltas in active_anon_scan?
>>
>>
>
> Here you go. I've shortened the line lengths to get them to squeeze into
> 80 columns:
>
> anon_scan, all HighMem zone, 187,910 active pages at loop start:
> count[12] 21462 -> 230, direct 20469, chains 3479, dj 58
> count[11] 1338 -> 1162, direct 227, chains 26144, dj 59
> count[8] 29397 -> 5410, direct 26115, chains 27617, dj 117
> count[4] 35804 -> 25556, direct 31508, chains 82929, dj 256
> count[3] 2738 -> 2207, direct 2680, chains 58, dj 7
> count[0] 92580 -> 89509, direct 75024, chains 262834, dj 726
> (age number is the index in [])
>
>
Where do all those ptes come from? that's 180K pages (most of highmem),
but with 550K ptes.
The memuser workload doesn't use fork(), so there shouldn't be any
indirect ptes.
We might try to unshadow the fixmap page; that means we don't have to do
4 fixmap pte accesses per pte scanned.
The kernel uses two methods for clearing the accessed bit:
For direct pages:
if (pte_young(*pte) && ptep_test_and_clear_young(pte))
referenced++;
(two accesses)
For indirect pages:
if (ptep_test_and_clear_young(pte))
referenced++;
(one access)
Which have to be emulated if we don't shadow the fixmap. With the data
above, that translates to 700K emulations with your numbers above, vs
2200K emulations, a 3X improvement. I'm not sure it will be sufficient
given that we're reducing a 10-second kscand scan into a 3-second scan.
> If you sum the direct pages and the chains count for each row, convert
> dj into dt (divided by HZ = 100) you get:
>
> ( 20469 + 3479 ) / 0.58 = 41289
> ( 227 + 26144 ) / 0.59 = 44696
> ( 26115 + 27617 ) / 1.17 = 45924
> ( 31508 + 82929 ) / 2.56 = 44701
> ( 2680 + 58 ) / 0.07 = 39114
> ( 75024 + 262834 ) / 7.26 = 46536
> ( 499 + 20022 ) / 0.44 = 46638
> ( 7189 + 9854 ) / 0.37 = 46062
> ( 5071 + 9388 ) / 0.31 = 46641
>
> For 4 pte writes per direct page or chain entry comes to ~187,000/sec
> which is close to the total collected by kvm_stat (data width shrunk to
> fit in e-mail; hope this is readable still):
>
>
> |---------- mmu_ ----------|----- pf_ -----|
> cache flood pde_z pte_u pte_w shado fixed guest
> 267 271 95 21455 21842 285 22840 165
> 66 88 0 12102 12224 88 12458 0
> 2042 2133 0 178146 180515 2133 188089 387
> 1053 1212 0 187067 188485 1212 193011 8
> 4771 4811 88 185129 190998 4825 207490 448
> 910 824 7 183066 184050 824 195836 12
> 707 785 0 176381 177300 785 180350 6
> 1167 1144 0 189618 191014 1144 195902 10
> 4238 4193 87 188381 193590 4206 207030 465
> 1448 1400 7 187786 189509 1400 198688 21
> 982 971 0 187880 189076 971 198405 2
> 1165 1208 0 190007 191503 1208 195746 13
> 1106 1146 0 189144 190550 1146 195143 0
> 4767 4788 96 185802 191704 4802 206362 477
> 1388 1431 0 187387 188991 1431 195115 3
> 584 551 0 77176 77802 551 84829 10
> 12 7 0 3601 3609 7 13497 4
> 243 153 91 31085 31333 167 35059 879
> 21 18 6 3130 3155 18 3827 2
> 21 4 1 4665 4670 4 6825 9
>
>
>>> The kvm_stat data for this time period is attached due to line lengths.
>>>
>>>
>>> Also, I forgot to mention this before, but there is a bug in the
>>> kscand code in the RHEL3U8 kernel. When it scans the cache list it
>>> uses the count from the anonymous list:
>>>
>>> if (need_active_cache_scan(zone)) {
>>> for (age = MAX_AGE-1; age >= 0; age--) {
>>> scan_active_list(zone, age,
>>> &zone->active_cache_list[age],
>>> zone->active_anon_count[age]);
>>> ^^^^^^^^^^^^^^^^^
>>> if (current->need_resched)
>>> schedule();
>>> }
>>> }
>>>
>>> When the anonymous count is higher it is scanning the cache list
>>> repeatedly. An example of that was captured here:
>>>
>>> active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon
>>> 111967, direct 626, dj 3
>>>
>>> count anon is active_anon_count[age] which at this moment was 111,967.
>>> There were only 222 entries in the cache list, but the count value
>>> passed to scan_active_list was 111,967. When the cache list has a lot
>>> of direct pages, that causes a larger hit on kvm than needed. That
>>> said, I have to live with the bug in the guest.
>>>
>>>
>> For debugging, can you fix it? It certainly has a large impact.
>>
>>
> yes, I have run a few tests with it fixed to get a ballpark on the
> impact. The fix is included in the number above.
>
>
>> Perhaps it is fixed in an update kernel. There's a 2.4.21-50.EL in the
>> centos 3.8 update repos.
>>
>>
It seems to have been fixed there.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-06-05 8:37 ` Avi Kivity
@ 2008-06-05 16:20 ` David S. Ahern
2008-06-06 16:40 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-06-05 16:20 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
Avi Kivity wrote:
> David S. Ahern wrote:
>>> Oh! Only 45K pages were direct, so the other 45K were shared, with
>>> perhaps many ptes. We shoud count ptes, not pages.
>>>
>>> Can you modify page_referenced() to count the numbers of ptes mapped (1
>>> for direct pages, nr_chains for indirect pages) and print the total
>>> deltas in active_anon_scan?
>>>
>>>
>>
>> Here you go. I've shortened the line lengths to get them to squeeze into
>> 80 columns:
>>
>> anon_scan, all HighMem zone, 187,910 active pages at loop start:
>> count[12] 21462 -> 230, direct 20469, chains 3479, dj 58
>> count[11] 1338 -> 1162, direct 227, chains 26144, dj 59
>> count[8] 29397 -> 5410, direct 26115, chains 27617, dj 117
>> count[4] 35804 -> 25556, direct 31508, chains 82929, dj 256
>> count[3] 2738 -> 2207, direct 2680, chains 58, dj 7
>> count[0] 92580 -> 89509, direct 75024, chains 262834, dj 726
>> (age number is the index in [])
>>
>>
>
> Where do all those ptes come from? that's 180K pages (most of highmem),
> but with 550K ptes.
>
> The memuser workload doesn't use fork(), so there shouldn't be any
> indirect ptes.
>
> We might try to unshadow the fixmap page; that means we don't have to do
> 4 fixmap pte accesses per pte scanned.
>
> The kernel uses two methods for clearing the accessed bit:
>
> For direct pages:
>
> if (pte_young(*pte) && ptep_test_and_clear_young(pte))
> referenced++;
>
> (two accesses)
>
> For indirect pages:
>
> if (ptep_test_and_clear_young(pte))
> referenced++;
>
> (one access)
>
> Which have to be emulated if we don't shadow the fixmap. With the data
> above, that translates to 700K emulations with your numbers above, vs
> 2200K emulations, a 3X improvement. I'm not sure it will be sufficient
> given that we're reducing a 10-second kscand scan into a 3-second scan.
>
A 3-second scan is much better and incomparison to where kvm was when I
started this e-mail thread (as high as 30-seconds for a scan) it's a
10-fold improvement.
I gave a shot at implementing your suggestion, but evidently I am still
not understanding the shadow implementation. Can you suggest a patch to
try this out?
david
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-06-05 16:20 ` David S. Ahern
@ 2008-06-06 16:40 ` Avi Kivity
2008-06-19 4:20 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-06-06 16:40 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
David S. Ahern wrote:
> I gave a shot at implementing your suggestion, but evidently I am still
> not understanding the shadow implementation. Can you suggest a patch to
> try this out?
>
We can have a hacking session in kvm forum. Bring a guest on your laptop.
It isn't going to be easy to both fix the problem and also not introduce
a regression somewhere else.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-06-06 16:40 ` Avi Kivity
@ 2008-06-19 4:20 ` David S. Ahern
2008-06-22 6:34 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-06-19 4:20 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
Avi:
We did not get a chance to do this at the Forum. I'd be interested in
whatever options you have for reducing the scan time further (e.g., try
to get scan time down to 1-2 seconds).
thanks,
david
Avi Kivity wrote:
> David S. Ahern wrote:
>> I gave a shot at implementing your suggestion, but evidently I am still
>> not understanding the shadow implementation. Can you suggest a patch to
>> try this out?
>>
>
> We can have a hacking session in kvm forum. Bring a guest on your laptop.
>
> It isn't going to be easy to both fix the problem and also not introduce
> a regression somewhere else.
>
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-06-19 4:20 ` David S. Ahern
@ 2008-06-22 6:34 ` Avi Kivity
2008-06-23 14:09 ` David S. Ahern
0 siblings, 1 reply; 73+ messages in thread
From: Avi Kivity @ 2008-06-22 6:34 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
[-- Attachment #1: Type: text/plain, Size: 1135 bytes --]
David S. Ahern wrote:
> Avi:
>
> We did not get a chance to do this at the Forum. I'd be interested in
> whatever options you have for reducing the scan time further (e.g., try
> to get scan time down to 1-2 seconds).
>
>
I'm unlikely to get time to do this properly for at least a week, as
this will be quite difficult and I'm already horribly backlogged.
However there's an alternative option, modifying the source and getting
it upstreamed, as I think RHEL 3 is still maintained.
The attached patch (untested) should give a 3X boost for kmap_atomics,
by folding the two accesses to set the pte into one, and by dropping the
access that clears the pte. Unfortunately it breaks the ABI, since
external modules will inline the original kmap_atomic() which expects
the pte to be cleared.
This can be worked around by allocating new fixmap slots for kmap_atomic
with the new behavior, and keeping the old slots with the old behavior,
but we should first see if the maintainers are open to performance
optimizations targeting kvm.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
[-- Attachment #2: faster-2.4-kmap_atomic.patch --]
[-- Type: text/x-patch, Size: 1057 bytes --]
--- include/asm-i386/atomic_kmap.h.orig 2007-06-12 00:24:29.000000000 +0300
+++ include/asm-i386/atomic_kmap.h 2008-06-22 09:23:26.000000000 +0300
@@ -51,18 +51,13 @@ static inline void *__kmap_atomic(struct
idx = type + KM_TYPE_NR*smp_processor_id();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#if HIGHMEM_DEBUG
- if (!pte_none(*(kmap_pte-idx)))
- out_of_line_bug();
-#else
/*
* Performance optimization - do not flush if the new
* pte is the same as the old one:
*/
if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot)))
return (void *) vaddr;
-#endif
- set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
+ set_pte_atomic(kmap_pte-idx, mk_pte(page, kmap_prot));
__flush_tlb_one(vaddr);
return (void*) vaddr;
@@ -77,12 +72,6 @@ static inline void __kunmap_atomic(void
if (vaddr != __fix_to_virt(FIX_KMAP_BEGIN+idx))
out_of_line_bug();
- /*
- * force other mappings to Oops if they'll try to access
- * this pte without first remap it
- */
- pte_clear(kmap_pte-idx);
- __flush_tlb_one(vaddr);
#endif
}
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-06-22 6:34 ` Avi Kivity
@ 2008-06-23 14:09 ` David S. Ahern
2008-06-25 9:51 ` Avi Kivity
0 siblings, 1 reply; 73+ messages in thread
From: David S. Ahern @ 2008-06-23 14:09 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
Avi Kivity wrote:
> David S. Ahern wrote:
>> Avi:
>>
>> We did not get a chance to do this at the Forum. I'd be interested in
>> whatever options you have for reducing the scan time further (e.g., try
>> to get scan time down to 1-2 seconds).
>>
>>
>
> I'm unlikely to get time to do this properly for at least a week, as
> this will be quite difficult and I'm already horribly backlogged.
> However there's an alternative option, modifying the source and getting
> it upstreamed, as I think RHEL 3 is still maintained.
>
> The attached patch (untested) should give a 3X boost for kmap_atomics,
> by folding the two accesses to set the pte into one, and by dropping the
> access that clears the pte. Unfortunately it breaks the ABI, since
> external modules will inline the original kmap_atomic() which expects
> the pte to be cleared.
>
> This can be worked around by allocating new fixmap slots for kmap_atomic
> with the new behavior, and keeping the old slots with the old behavior,
> but we should first see if the maintainers are open to performance
> optimizations targeting kvm.
>
RHEL3 is in Maintenance mode (for an explanation see
http://www.redhat.com/security/updates/errata/) which means performance
enhancement patches will not make it in.
Also, I'm going to be out of the office for a couple of weeks in July,
so I will need to put this aside until mid-August or so. I'll reevaluate
options then.
david
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
2008-06-23 14:09 ` David S. Ahern
@ 2008-06-25 9:51 ` Avi Kivity
0 siblings, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2008-06-25 9:51 UTC (permalink / raw)
To: David S. Ahern; +Cc: kvm
David S. Ahern wrote:
>
> RHEL3 is in Maintenance mode (for an explanation see
> http://www.redhat.com/security/updates/errata/) which means performance
> enhancement patches will not make it in.
>
>
Scratch that idea, then.
> Also, I'm going to be out of the office for a couple of weeks in July,
> so I will need to put this aside until mid-August or so. I'll reevaluate
> options then.
>
One thing I'm looking at is implementing out-of-sync like Xen, which
looks like it will obsolete the entire emulate vs flood thing at the
cost of making unshadowing a little more expensive and consuming more
memory. See
http://thread.gmane.org/gmane.comp.emulators.xen.devel/52557 (and 58,
59, 60).
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 73+ messages in thread
end of thread, other threads:[~2008-06-25 9:51 UTC | newest]
Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-16 0:15 performance with guests running 2.4 kernels (specifically RHEL3) David S. Ahern
2008-04-16 8:46 ` Avi Kivity
2008-04-17 21:12 ` David S. Ahern
2008-04-18 7:57 ` Avi Kivity
2008-04-21 4:31 ` David S. Ahern
2008-04-21 9:19 ` Avi Kivity
2008-04-21 17:07 ` David S. Ahern
2008-04-22 20:23 ` David S. Ahern
2008-04-23 8:04 ` Avi Kivity
2008-04-23 15:23 ` David S. Ahern
2008-04-23 15:53 ` Avi Kivity
2008-04-23 16:39 ` David S. Ahern
2008-04-24 17:25 ` David S. Ahern
2008-04-26 6:43 ` Avi Kivity
2008-04-26 6:20 ` Avi Kivity
2008-04-25 17:33 ` David S. Ahern
2008-04-26 6:45 ` Avi Kivity
2008-04-28 18:15 ` Marcelo Tosatti
2008-04-28 23:45 ` David S. Ahern
2008-04-30 4:18 ` David S. Ahern
2008-04-30 9:55 ` Avi Kivity
2008-04-30 13:39 ` David S. Ahern
2008-04-30 13:49 ` Avi Kivity
2008-05-11 12:32 ` Avi Kivity
2008-05-11 13:36 ` Avi Kivity
2008-05-13 3:49 ` David S. Ahern
2008-05-13 7:25 ` Avi Kivity
2008-05-14 20:35 ` David S. Ahern
2008-05-15 10:53 ` Avi Kivity
2008-05-17 4:31 ` David S. Ahern
[not found] ` <482FCEE1.5040306@qumranet.com>
[not found] ` <4830F90A.1020809@cisco.com>
2008-05-19 4:14 ` [kvm-devel] " David S. Ahern
2008-05-19 14:27 ` Avi Kivity
2008-05-19 16:25 ` David S. Ahern
2008-05-19 17:04 ` Avi Kivity
2008-05-20 14:19 ` Avi Kivity
2008-05-20 14:34 ` Avi Kivity
2008-05-22 22:08 ` David S. Ahern
2008-05-28 10:51 ` Avi Kivity
2008-05-28 14:13 ` David S. Ahern
2008-05-28 14:35 ` Avi Kivity
2008-05-28 19:49 ` David S. Ahern
2008-05-29 6:37 ` Avi Kivity
2008-05-28 14:48 ` Andrea Arcangeli
2008-05-28 14:57 ` Avi Kivity
2008-05-28 15:39 ` David S. Ahern
2008-05-29 11:49 ` Avi Kivity
2008-05-29 12:10 ` Avi Kivity
2008-05-29 13:49 ` David S. Ahern
2008-05-29 14:08 ` Avi Kivity
2008-05-28 15:58 ` Andrea Arcangeli
2008-05-28 15:37 ` Avi Kivity
2008-05-28 15:43 ` David S. Ahern
2008-05-28 17:04 ` Andrea Arcangeli
2008-05-28 17:24 ` David S. Ahern
2008-05-29 10:01 ` Avi Kivity
2008-05-29 14:27 ` Andrea Arcangeli
2008-05-29 15:11 ` David S. Ahern
2008-05-29 15:16 ` Avi Kivity
2008-05-30 13:12 ` Andrea Arcangeli
2008-05-31 7:39 ` Avi Kivity
2008-05-29 16:42 ` David S. Ahern
2008-05-31 8:16 ` Avi Kivity
2008-06-02 16:42 ` David S. Ahern
2008-06-05 8:37 ` Avi Kivity
2008-06-05 16:20 ` David S. Ahern
2008-06-06 16:40 ` Avi Kivity
2008-06-19 4:20 ` David S. Ahern
2008-06-22 6:34 ` Avi Kivity
2008-06-23 14:09 ` David S. Ahern
2008-06-25 9:51 ` Avi Kivity
2008-04-30 13:56 ` Daniel P. Berrange
2008-04-30 14:23 ` David S. Ahern
2008-04-23 8:03 ` Avi Kivity
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.