Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2030-04-14  9:05 [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side Zhang, Yanmin
@ 2010-04-14  9:20 ` Avi Kivity
  2010-04-14  9:43   ` Sheng Yang
                     ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Avi Kivity @ 2010-04-14  9:20 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Ingo Molnar, Peter Zijlstra, Sheng Yang, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
> Here is the new patch of V3 against tip/master of April 13th
> if anyone wants to try it.
>
>    

Thanks for persisting despite the flames.

Can you please separate arch/x86/kvm part of the patch?  That will make 
for easier reviewing, and will need to go through separate trees.

Sheng, did you make any progress with the NMI injection issue?

> +
> diff -Nraup linux-2.6_tip0413/arch/x86/kvm/x86.c linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c
> --- linux-2.6_tip0413/arch/x86/kvm/x86.c	2010-04-14 11:11:04.341042024 +0800
> +++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c	2010-04-14 11:32:45.841278890 +0800
> @@ -3765,6 +3765,35 @@ static void kvm_timer_init(void)
>   	}
>   }
>
> +static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
> +
> +static int kvm_is_in_guest(void)
> +{
> +	return percpu_read(current_vcpu) != NULL;
>    

An even more accurate way to determine this is to check whether the 
interrupt frame points back at the 'int $2' instruction.  However we 
plan to switch to a self-IPI method to inject the NMI, and I'm not sure 
wether APIC NMIs are accepted on an instruction boundary or whether 
there's some latency involved.

> +static unsigned long kvm_get_guest_ip(void)
> +{
> +	unsigned long ip = 0;
> +	if (percpu_read(current_vcpu))
> +		ip = kvm_rip_read(percpu_read(current_vcpu));
> +	return ip;
> +}
>    

This may be racy.  kvm_rip_read() accesses a cache in memory; if we're 
in the process of updating the cache, then we may read a stale value.  
See below.

>
>   	trace_kvm_entry(vcpu->vcpu_id);
> +
> +	percpu_write(current_vcpu, vcpu);
>   	kvm_x86_ops->run(vcpu);
> +	percpu_write(current_vcpu, NULL);
>    

If you move this around the 'int $2' instructions you will close the 
race, as a stray NMI won't catch us updating the rip cache.  But that 
depends on whether self-IPI is accepted on the next instruction or not.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-14  9:20 ` Avi Kivity
@ 2010-04-14  9:43   ` Sheng Yang
  2010-04-14  9:57     ` Avi Kivity
  2010-04-14 10:43   ` Ingo Molnar
  2030-04-15  1:04   ` Zhang, Yanmin
  2 siblings, 1 reply; 28+ messages in thread
From: Sheng Yang @ 2010-04-14  9:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Wednesday 14 April 2010 17:20:15 Avi Kivity wrote:
> On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
> > Here is the new patch of V3 against tip/master of April 13th
> > if anyone wants to try it.
> 
> Thanks for persisting despite the flames.
> 
> Can you please separate arch/x86/kvm part of the patch?  That will make
> for easier reviewing, and will need to go through separate trees.
> 
> Sheng, did you make any progress with the NMI injection issue?

Yes, though some other works interrupt me lately...

The very first version has issue due to SELF_IPI mode can't be used to send 
NMI according to SDM. That's the reason why x2apic don't have way to do this.

But later I found another issue of fail to inspect inside the guest. I think 
it's due to NMI is asynchronous event, though it should be triggered very 
quickly, you can't guarantee that the handler would be triggered before the 
state(current_vcpu) is cleared with current code.

Maybe just extended the "guest state" region would be fine, if the latency is 
stable enough(though I think it maybe platform depended). I am working on this 
now. 

-- 
regards
Yang, Sheng

> 
> > +
> > diff -Nraup linux-2.6_tip0413/arch/x86/kvm/x86.c
> > linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c ---
> > linux-2.6_tip0413/arch/x86/kvm/x86.c	2010-04-14 11:11:04.341042024 +0800
> > +++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c	2010-04-14
> > 11:32:45.841278890 +0800 @@ -3765,6 +3765,35 @@ static void
> > kvm_timer_init(void)
> >   	}
> >   }
> >
> > +static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
> > +
> > +static int kvm_is_in_guest(void)
> > +{
> > +	return percpu_read(current_vcpu) != NULL;
> 
> An even more accurate way to determine this is to check whether the
> interrupt frame points back at the 'int $2' instruction.  However we
> plan to switch to a self-IPI method to inject the NMI, and I'm not sure
> wether APIC NMIs are accepted on an instruction boundary or whether
> there's some latency involved.
> 
> > +static unsigned long kvm_get_guest_ip(void)
> > +{
> > +	unsigned long ip = 0;
> > +	if (percpu_read(current_vcpu))
> > +		ip = kvm_rip_read(percpu_read(current_vcpu));
> > +	return ip;
> > +}
> 
> This may be racy.  kvm_rip_read() accesses a cache in memory; if we're
> in the process of updating the cache, then we may read a stale value.
> See below.
> 
> >   	trace_kvm_entry(vcpu->vcpu_id);
> > +
> > +	percpu_write(current_vcpu, vcpu);
> >   	kvm_x86_ops->run(vcpu);
> > +	percpu_write(current_vcpu, NULL);
> 
> If you move this around the 'int $2' instructions you will close the
> race, as a stray NMI won't catch us updating the rip cache.  But that
> depends on whether self-IPI is accepted on the next instruction or not.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-14  9:43   ` Sheng Yang
@ 2010-04-14  9:57     ` Avi Kivity
  2010-04-14 10:14       ` Sheng Yang
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2010-04-14  9:57 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/14/2010 12:43 PM, Sheng Yang wrote:
> On Wednesday 14 April 2010 17:20:15 Avi Kivity wrote:
>    
>> On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
>>      
>>> Here is the new patch of V3 against tip/master of April 13th
>>> if anyone wants to try it.
>>>        
>> Thanks for persisting despite the flames.
>>
>> Can you please separate arch/x86/kvm part of the patch?  That will make
>> for easier reviewing, and will need to go through separate trees.
>>
>> Sheng, did you make any progress with the NMI injection issue?
>>      
> Yes, though some other works interrupt me lately...
>
> The very first version has issue due to SELF_IPI mode can't be used to send
> NMI according to SDM. That's the reason why x2apic don't have way to do this.
>    

Yes, I see that now.  Looks like others have the same questions...

> But later I found another issue of fail to inspect inside the guest. I think
> it's due to NMI is asynchronous event, though it should be triggered very
> quickly, you can't guarantee that the handler would be triggered before the
> state(current_vcpu) is cleared with current code.
>
> Maybe just extended the "guest state" region would be fine, if the latency is
> stable enough(though I think it maybe platform depended). I am working on this
> now.
>    

I wouldn't like to depend on model specific behaviour.

One option is to read all the information synchronously and store it in 
a per-cpu area with atomic instructions, then queue the NMI.  Another 
option is to have another callback which tells us that the NMI is done, 
and have a busy loop wait until the NMI is delivered.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-14  9:57     ` Avi Kivity
@ 2010-04-14 10:14       ` Sheng Yang
  2010-04-14 10:19         ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Sheng Yang @ 2010-04-14 10:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Wednesday 14 April 2010 17:57:50 Avi Kivity wrote:
> On 04/14/2010 12:43 PM, Sheng Yang wrote:
> > On Wednesday 14 April 2010 17:20:15 Avi Kivity wrote:
> >> On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
> >>> Here is the new patch of V3 against tip/master of April 13th
> >>> if anyone wants to try it.
> >>
> >> Thanks for persisting despite the flames.
> >>
> >> Can you please separate arch/x86/kvm part of the patch?  That will make
> >> for easier reviewing, and will need to go through separate trees.
> >>
> >> Sheng, did you make any progress with the NMI injection issue?
> >
> > Yes, though some other works interrupt me lately...
> >
> > The very first version has issue due to SELF_IPI mode can't be used to
> > send NMI according to SDM. That's the reason why x2apic don't have way to
> > do this.
> 
> Yes, I see that now.  Looks like others have the same questions...
> 
> > But later I found another issue of fail to inspect inside the guest. I
> > think it's due to NMI is asynchronous event, though it should be
> > triggered very quickly, you can't guarantee that the handler would be
> > triggered before the state(current_vcpu) is cleared with current code.
> >
> > Maybe just extended the "guest state" region would be fine, if the
> > latency is stable enough(though I think it maybe platform depended). I am
> > working on this now.
> 
> I wouldn't like to depend on model specific behaviour.
>
> One option is to read all the information synchronously and store it in
> a per-cpu area with atomic instructions, then queue the NMI.  Another
> option is to have another callback which tells us that the NMI is done,
> and have a busy loop wait until the NMI is delivered.
> 
Callback seems too heavy, may affect the performance badly. Maybe a short 
queue would help, though this one is more complex.

But I am still curious if we extend the region, how much it would help. Would 
get a result soon...

-- 
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-14 10:14       ` Sheng Yang
@ 2010-04-14 10:19         ` Avi Kivity
  2010-04-14 10:27           ` Sheng Yang
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2010-04-14 10:19 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/14/2010 01:14 PM, Sheng Yang wrote:
>
>> I wouldn't like to depend on model specific behaviour.
>>
>> One option is to read all the information synchronously and store it in
>> a per-cpu area with atomic instructions, then queue the NMI.  Another
>> option is to have another callback which tells us that the NMI is done,
>> and have a busy loop wait until the NMI is delivered.
>>
>>      
> Callback seems too heavy, may affect the performance badly. Maybe a short
> queue would help, though this one is more complex.
>    

The patch we're replying to adds callbacks (to read rip, etc.), so it's 
no big deal.  For the queue solution, a queue of size one would probably 
be sufficient even if not guaranteed by the spec.  I don't see how the 
cpu can do another guest entry without delivering the NMI.

> But I am still curious if we extend the region, how much it would help. Would
> get a result soon...
>    

Yes, interesting to see what the latency is.  If it's reasonably short 
(and I expect it will be so), we can do the busy wait solution.

If we have an NMI counter somewhere, we can simply wait until it changes.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-14 10:19         ` Avi Kivity
@ 2010-04-14 10:27           ` Sheng Yang
  2010-04-14 10:33             ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Sheng Yang @ 2010-04-14 10:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Wednesday 14 April 2010 18:19:49 Avi Kivity wrote:
> On 04/14/2010 01:14 PM, Sheng Yang wrote:
> >> I wouldn't like to depend on model specific behaviour.
> >>
> >> One option is to read all the information synchronously and store it in
> >> a per-cpu area with atomic instructions, then queue the NMI.  Another
> >> option is to have another callback which tells us that the NMI is done,
> >> and have a busy loop wait until the NMI is delivered.
> >
> > Callback seems too heavy, may affect the performance badly. Maybe a short
> > queue would help, though this one is more complex.
> 
> The patch we're replying to adds callbacks (to read rip, etc.), so it's
> no big deal.  For the queue solution, a queue of size one would probably
> be sufficient even if not guaranteed by the spec.  I don't see how the
> cpu can do another guest entry without delivering the NMI.
> 
> > But I am still curious if we extend the region, how much it would help.
> > Would get a result soon...
> 
> Yes, interesting to see what the latency is.  If it's reasonably short
> (and I expect it will be so), we can do the busy wait solution.
> 
> If we have an NMI counter somewhere, we can simply wait until it changes.
 
Good idea. Of course we have one(at least on x86). There is 
irq_stat.irq__nmi_count for per cpu. :)

-- 
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-14 10:27           ` Sheng Yang
@ 2010-04-14 10:33             ` Avi Kivity
  2010-04-14 10:36               ` Sheng Yang
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2010-04-14 10:33 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/14/2010 01:27 PM, Sheng Yang wrote:
>
>> Yes, interesting to see what the latency is.  If it's reasonably short
>> (and I expect it will be so), we can do the busy wait solution.
>>
>> If we have an NMI counter somewhere, we can simply wait until it changes.
>>      
>
> Good idea. Of course we have one(at least on x86). There is
> irq_stat.irq__nmi_count for per cpu. :)
>    

Okay, but kvm doesn't want to know about it.  How about a new arch 
function, invoke_nmi_sync(), that will trigger the NMI and wait for it?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-14 10:33             ` Avi Kivity
@ 2010-04-14 10:36               ` Sheng Yang
  0 siblings, 0 replies; 28+ messages in thread
From: Sheng Yang @ 2010-04-14 10:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Wednesday 14 April 2010 18:33:37 Avi Kivity wrote:
> On 04/14/2010 01:27 PM, Sheng Yang wrote:
> >> Yes, interesting to see what the latency is.  If it's reasonably short
> >> (and I expect it will be so), we can do the busy wait solution.
> >>
> >> If we have an NMI counter somewhere, we can simply wait until it
> >> changes.
> >
> > Good idea. Of course we have one(at least on x86). There is
> > irq_stat.irq__nmi_count for per cpu. :)
> 
> Okay, but kvm doesn't want to know about it.  How about a new arch
> function, invoke_nmi_sync(), that will trigger the NMI and wait for it?
> 
Sound reasonable. Would try it.

-- 
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-14  9:20 ` Avi Kivity
  2010-04-14  9:43   ` Sheng Yang
@ 2010-04-14 10:43   ` Ingo Molnar
  2010-04-14 11:17     ` Avi Kivity
  2030-04-15  1:04   ` Zhang, Yanmin
  2 siblings, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2010-04-14 10:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Zhang, Yanmin, Peter Zijlstra, Sheng Yang, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo


* Avi Kivity <avi@redhat.com> wrote:

> On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
> >Here is the new patch of V3 against tip/master of April 13th
> >if anyone wants to try it.
> >
> 
> Thanks for persisting despite the flames.
> 
> Can you please separate arch/x86/kvm part of the patch?  That will make for 
> easier reviewing, and will need to go through separate trees.

Once it gets into a state that it can be applied could you please create a 
separate, -git based branch for it, so that i can pull it for testing and 
integration with the tools/perf/ bits?

Assuming there are no serious conflicts with pending KVM work.

(or i can do that too)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-14 10:43   ` Ingo Molnar
@ 2010-04-14 11:17     ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2010-04-14 11:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Peter Zijlstra, Sheng Yang, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/14/2010 01:43 PM, Ingo Molnar wrote:
>>
>> Thanks for persisting despite the flames.
>>
>> Can you please separate arch/x86/kvm part of the patch?  That will make for
>> easier reviewing, and will need to go through separate trees.
>>      
> Once it gets into a state that it can be applied could you please create a
> separate, -git based branch for it, so that i can pull it for testing and
> integration with the tools/perf/ bits?
>
>    

Sure.

> Assuming there are no serious conflicts with pending KVM work.
>    

There will be a conflict with the NMI fix (which has to go in first, 
we'll want to backport it), I'll put it on the same branch.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2030-04-15  1:04   ` Zhang, Yanmin
@ 2010-04-15  8:05     ` Avi Kivity
  2030-04-15  8:57       ` Zhang, Yanmin
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2010-04-15  8:05 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Ingo Molnar, Peter Zijlstra, Sheng Yang, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/15/2030 04:04 AM, Zhang, Yanmin wrote:
>
>> An even more accurate way to determine this is to check whether the
>> interrupt frame points back at the 'int $2' instruction.  However we
>> plan to switch to a self-IPI method to inject the NMI, and I'm not sure
>> wether APIC NMIs are accepted on an instruction boundary or whether
>> there's some latency involved.
>>      
> Yes. But the frame pointer checking seems a little complicated.
>    

An even bigger disadvantage is that it won't work with Sheng's patch, 
self-NMIs are not synchronous.

>>>    	trace_kvm_entry(vcpu->vcpu_id);
>>> +
>>> +	percpu_write(current_vcpu, vcpu);
>>>    	kvm_x86_ops->run(vcpu);
>>> +	percpu_write(current_vcpu, NULL);
>>>
>>>        
>> If you move this around the 'int $2' instructions you will close the
>> race, as a stray NMI won't catch us updating the rip cache.  But that
>> depends on whether self-IPI is accepted on the next instruction or not.
>>      
> Right. The kernel part has dependency on the self-IPI implementation.
> I will move above percpu_write(current_vcpu, vcpu) (or a new wrapper function)
> just around 'int $2'.
>
>    

Or create a new function to inject the interrupt in x86.c.  That will 
reduce duplication between svm.c and vmx.c.

> Sheng would find a solution on the self-IPI delivery. Let's separate my patch
> and self-IPI as 2 issues as we don't know when the self-IPI delivery would be
> resolved.
>    

Sure.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2030-04-15  8:57       ` Zhang, Yanmin
@ 2010-04-15  9:04         ` oerg Roedel
  2010-04-15  9:09           ` Avi Kivity
  2010-04-27 19:03         ` [PATCH] Psychovisually-optimized HZ setting (2.6.33.3) Uwaysi Bin Kareem
  1 sibling, 1 reply; 28+ messages in thread
From: oerg Roedel @ 2010-04-15  9:04 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Avi Kivity, Ingo Molnar, Peter Zijlstra, Sheng Yang,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Mon, Apr 15, 2030 at 04:57:38PM +0800, Zhang, Yanmin wrote:

> I checked svm.c and it seems svm.c doesn't trigger a NMI to host if the NMI
> happens in guest os. In addition, svm_complete_interrupts is called after
> interrupt is enabled.

Yes. The NMI is held pending by the hardware until the STGI instruction
is executed.
And for nested svm the svm_complete_interrupts function needs to be
executed after the nested exit handling. Therefore it is done late on
svm.

	Joerg


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-15  9:04         ` oerg Roedel
@ 2010-04-15  9:09           ` Avi Kivity
  2010-04-15  9:44             ` oerg Roedel
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2010-04-15  9:09 UTC (permalink / raw)
  To: oerg Roedel
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, Sheng Yang,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/15/2010 12:04 PM, oerg Roedel wrote:
> On Mon, Apr 15, 2030 at 04:57:38PM +0800, Zhang, Yanmin wrote:
>
>    
>> I checked svm.c and it seems svm.c doesn't trigger a NMI to host if the NMI
>> happens in guest os. In addition, svm_complete_interrupts is called after
>> interrupt is enabled.
>>      
> Yes. The NMI is held pending by the hardware until the STGI instruction
> is executed.
> And for nested svm the svm_complete_interrupts function needs to be
> executed after the nested exit handling. Therefore it is done late on
> svm.
>    

So, we'd need something like the following:

    if (exit == NMI)
        __get_cpu_var(nmi_vcpu) = vcpu;

    stgi();

    if (exit == NMI) {
        while (!nmi_handled())
            cpu_relax();
        __get_cpu_var(nmi_vcpu) = NULL;
    }

and no code sharing betweem vmx and svm.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-15  9:09           ` Avi Kivity
@ 2010-04-15  9:44             ` oerg Roedel
  2010-04-15  9:48               ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: oerg Roedel @ 2010-04-15  9:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, Sheng Yang,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Thu, Apr 15, 2010 at 12:09:28PM +0300, Avi Kivity wrote:
> On 04/15/2010 12:04 PM, oerg Roedel wrote:
>> On Mon, Apr 15, 2030 at 04:57:38PM +0800, Zhang, Yanmin wrote:
>>
>>    
>>> I checked svm.c and it seems svm.c doesn't trigger a NMI to host if the NMI
>>> happens in guest os. In addition, svm_complete_interrupts is called after
>>> interrupt is enabled.
>>>      
>> Yes. The NMI is held pending by the hardware until the STGI instruction
>> is executed.
>> And for nested svm the svm_complete_interrupts function needs to be
>> executed after the nested exit handling. Therefore it is done late on
>> svm.
>>    
>
> So, we'd need something like the following:
>
>    if (exit == NMI)
>        __get_cpu_var(nmi_vcpu) = vcpu;
>
>    stgi();
>
>    if (exit == NMI) {
>        while (!nmi_handled())
>            cpu_relax();
>        __get_cpu_var(nmi_vcpu) = NULL;
>    }

Hmm, looks a bit complicated to me. The NMI should happen shortly after
the stgi instruction. Interrupts are still disabled so we stay on this
cpu. Can't we just set and erase the cpu_var at vcpu_load/vcpu_put time?

	Joerg


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-15  9:44             ` oerg Roedel
@ 2010-04-15  9:48               ` Avi Kivity
  2010-04-15 10:40                 ` Joerg Roedel
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2010-04-15  9:48 UTC (permalink / raw)
  To: oerg Roedel
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, Sheng Yang,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/15/2010 12:44 PM, oerg Roedel wrote:
>
>> So, we'd need something like the following:
>>
>>     if (exit == NMI)
>>         __get_cpu_var(nmi_vcpu) = vcpu;
>>
>>     stgi();
>>
>>     if (exit == NMI) {
>>         while (!nmi_handled())
>>             cpu_relax();
>>         __get_cpu_var(nmi_vcpu) = NULL;
>>     }
>>      
> Hmm, looks a bit complicated to me. The NMI should happen shortly after
> the stgi instruction. Interrupts are still disabled so we stay on this
> cpu. Can't we just set and erase the cpu_var at vcpu_load/vcpu_put time?
>
>    

That means an NMI that happens outside guest code (for example, in the 
mmu, or during the exit itself) would be counted as if in guest code.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-15  9:48               ` Avi Kivity
@ 2010-04-15 10:40                 ` Joerg Roedel
  2010-04-15 10:44                   ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Joerg Roedel @ 2010-04-15 10:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, Sheng Yang,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Thu, Apr 15, 2010 at 12:48:09PM +0300, Avi Kivity wrote:
> On 04/15/2010 12:44 PM, oerg Roedel wrote:
>>
>>> So, we'd need something like the following:
>>>
>>>     if (exit == NMI)
>>>         __get_cpu_var(nmi_vcpu) = vcpu;
>>>
>>>     stgi();
>>>
>>>     if (exit == NMI) {
>>>         while (!nmi_handled())
>>>             cpu_relax();
>>>         __get_cpu_var(nmi_vcpu) = NULL;
>>>     }
>>>      
>> Hmm, looks a bit complicated to me. The NMI should happen shortly after
>> the stgi instruction. Interrupts are still disabled so we stay on this
>> cpu. Can't we just set and erase the cpu_var at vcpu_load/vcpu_put time?
>>
>>    
>
> That means an NMI that happens outside guest code (for example, in the  
> mmu, or during the exit itself) would be counted as if in guest code.

Hmm, true. The same is true for an NMI that happens between VMSAVE and
STGI but that window is smaller. Anyway, I think we don't need the
busy-wait loop. The NMI should be executed at a well defined point and
we set the cpu_var back to NULL after that point.

	Joerg


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-15 10:40                 ` Joerg Roedel
@ 2010-04-15 10:44                   ` Avi Kivity
  2010-04-15 14:08                     ` Sheng Yang
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2010-04-15 10:44 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Zhang, Yanmin, Ingo Molnar, Peter Zijlstra, Sheng Yang,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/15/2010 01:40 PM, Joerg Roedel wrote:
>
>> That means an NMI that happens outside guest code (for example, in the
>> mmu, or during the exit itself) would be counted as if in guest code.
>>      
> Hmm, true. The same is true for an NMI that happens between VMSAVE and
> STGI but that window is smaller. Anyway, I think we don't need the
> busy-wait loop. The NMI should be executed at a well defined point and
> we set the cpu_var back to NULL after that point.
>    

The point is not well defined.  Considering there are already at least 
two implementations svm, I don't want to rely on implementation details.

We could tune the position of the loop so that zero iterations are 
executed on the implementations we know about.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest  os statistics from host side
  2010-04-15 10:44                   ` Avi Kivity
@ 2010-04-15 14:08                     ` Sheng Yang
  2010-04-17 18:12                       ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Sheng Yang @ 2010-04-15 14:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Joerg Roedel, Zhang, Yanmin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Thursday 15 April 2010 18:44:15 Avi Kivity wrote:
> On 04/15/2010 01:40 PM, Joerg Roedel wrote:
> >> That means an NMI that happens outside guest code (for example, in the
> >> mmu, or during the exit itself) would be counted as if in guest code.
> >
> > Hmm, true. The same is true for an NMI that happens between VMSAVE and
> > STGI but that window is smaller. Anyway, I think we don't need the
> > busy-wait loop. The NMI should be executed at a well defined point and
> > we set the cpu_var back to NULL after that point.
> 
> The point is not well defined.  Considering there are already at least
> two implementations svm, I don't want to rely on implementation details.

After more investigating, I realized that I had interpreted the SDM wrong. 
Sorry.

There is *no* risk with the original method of calling "int $2". 

According to the SDM 24.1:

> The following bullets detail when architectural state is and is not updated 
in response to VM exits:
[...]
> - An NMI causes subsequent NMIs to be blocked, but only after the VM exit 
completes.

So the truth is, after NMI directly caused VMExit, the following NMIs would be 
blocked, until encountered next "iret". So execute "int $2" is safe in 
vmx_complete_interrupts(), no risk in causing nested NMI. And it would unblock 
the following NMIs as well due to "iret" it executed.

So there is unnecessary to make change to avoid "potential nested NMI".

Sorry for the mistake and caused confusing.

-- 
regards
Yang, Sheng

> 
> We could tune the position of the loop so that zero iterations are
> executed on the implementations we know about.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest  os statistics from host side
  2010-04-15 14:08                     ` Sheng Yang
@ 2010-04-17 18:12                       ` Avi Kivity
  2010-04-19  8:25                         ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2010-04-17 18:12 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Joerg Roedel, Zhang, Yanmin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/15/2010 05:08 PM, Sheng Yang wrote:
> On Thursday 15 April 2010 18:44:15 Avi Kivity wrote:
>    
>> On 04/15/2010 01:40 PM, Joerg Roedel wrote:
>>      
>>>> That means an NMI that happens outside guest code (for example, in the
>>>> mmu, or during the exit itself) would be counted as if in guest code.
>>>>          
>>> Hmm, true. The same is true for an NMI that happens between VMSAVE and
>>> STGI but that window is smaller. Anyway, I think we don't need the
>>> busy-wait loop. The NMI should be executed at a well defined point and
>>> we set the cpu_var back to NULL after that point.
>>>        
>> The point is not well defined.  Considering there are already at least
>> two implementations svm, I don't want to rely on implementation details.
>>      
> After more investigating, I realized that I had interpreted the SDM wrong.
> Sorry.
>
> There is *no* risk with the original method of calling "int $2".
>
> According to the SDM 24.1:
>
>    
>> The following bullets detail when architectural state is and is not updated
>>      
> in response to VM exits:
> [...]
>    
>> - An NMI causes subsequent NMIs to be blocked, but only after the VM exit
>>      
> completes.
>
> So the truth is, after NMI directly caused VMExit, the following NMIs would be
> blocked, until encountered next "iret". So execute "int $2" is safe in
> vmx_complete_interrupts(), no risk in causing nested NMI. And it would unblock
> the following NMIs as well due to "iret" it executed.
>
> So there is unnecessary to make change to avoid "potential nested NMI".
>    

Let's look at the surrounding text...

>
> The following bullets detail when architectural state is and is not 
> updated in response
> to VM exits:
> •   If an event causes a VM exit directly, it does not update 
> architectural state as it
>     would have if it had it not caused the VM exit:
>     — A debug exception does not update DR6, DR7.GD, or IA32_DEBUGCTL.LBR.
>         (Information about the nature of the debug exception is saved 
> in the exit
>         qualification field.)
>     — A page fault does not update CR2. (The linear address causing 
> the page fault
>         is saved in the exit-qualification field.)
>     — An NMI causes subsequent NMIs to be blocked, but only after the 
> VM exit
>         completes.
>     — An external interrupt does not acknowledge the interrupt 
> controller and the
>         interrupt remains pending, unless the “acknowledge interrupt 
> on exit”
>         VM-exit control is 1. In such a case, the interrupt controller 
> is acknowledged
>         and the interrupt is no longer pending.


Everywhere it says state is _not_ updated, so I think what is meant is 
that NMIs are blocked, but only _until_ the VM exit completes.

I think you were right the first time around.  Can you check with your 
architecture team?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest  os statistics from host side
  2010-04-17 18:12                       ` Avi Kivity
@ 2010-04-19  8:25                         ` Avi Kivity
  2010-04-20  3:32                           ` Sheng Yang
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2010-04-19  8:25 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Joerg Roedel, Zhang, Yanmin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/17/2010 09:12 PM, Avi Kivity wrote:
>
> I think you were right the first time around.
>

Re-reading again (esp. the part about treatment of indirect NMI 
vmexits), I think this was wrong, and that the code is correct.  I am 
now thoroughly confused.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest  os statistics from host side
  2010-04-19  8:25                         ` Avi Kivity
@ 2010-04-20  3:32                           ` Sheng Yang
  2010-04-20  9:38                             ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Sheng Yang @ 2010-04-20  3:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Joerg Roedel, Zhang, Yanmin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Monday 19 April 2010 16:25:17 Avi Kivity wrote:
> On 04/17/2010 09:12 PM, Avi Kivity wrote:
> > I think you were right the first time around.
> 
> Re-reading again (esp. the part about treatment of indirect NMI
> vmexits), I think this was wrong, and that the code is correct.  I am
> now thoroughly confused.
> 
My fault...

To my understanding now, "If an event causes a VM exit directly, it does not 
update architectural state as it would have if it had it not caused the VM 
exit:", means: in NMI case, NMI would involve the NMI handler, and change the 
"architectural state" to NMI block. In VMX non-root mode, the behavior of 
calling NMI handler changed(determine by some VMCS fields), but not the 
affection to the "architectural state". So the NMI block state would remain 
the same.

-- 
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest  os statistics from host side
  2010-04-20  3:32                           ` Sheng Yang
@ 2010-04-20  9:38                             ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2010-04-20  9:38 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Joerg Roedel, Zhang, Yanmin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, kvm, Marcelo Tosatti, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On 04/20/2010 06:32 AM, Sheng Yang wrote:
> On Monday 19 April 2010 16:25:17 Avi Kivity wrote:
>    
>> On 04/17/2010 09:12 PM, Avi Kivity wrote:
>>      
>>> I think you were right the first time around.
>>>        
>> Re-reading again (esp. the part about treatment of indirect NMI
>> vmexits), I think this was wrong, and that the code is correct.  I am
>> now thoroughly confused.
>>
>>      
> My fault...
>    

Not at all, it's really confusingly worded.

> To my understanding now, "If an event causes a VM exit directly, it does not
> update architectural state as it would have if it had it not caused the VM
> exit:", means: in NMI case, NMI would involve the NMI handler, and change the
> "architectural state" to NMI block. In VMX non-root mode, the behavior of
> calling NMI handler changed(determine by some VMCS fields), but not the
> affection to the "architectural state". So the NMI block state would remain
> the same.
>    

Agree.  It's confusing because the internal "nmi pending" flag is not 
set, while the "nmi blocking" flag is set.

(on svm both are set, but the NMI is not taken until the vmexit 
completes and the host unmasks NMIs).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] Psychovisually-optimized HZ setting (2.6.33.3)
  2030-04-15  8:57       ` Zhang, Yanmin
  2010-04-15  9:04         ` oerg Roedel
@ 2010-04-27 19:03         ` Uwaysi Bin Kareem
  2010-04-27 19:51           ` Randy Dunlap
  2010-04-27 21:50           ` Valdis.Kletnieks
  1 sibling, 2 replies; 28+ messages in thread
From: Uwaysi Bin Kareem @ 2010-04-27 19:03 UTC (permalink / raw)
  To: linux-kernel

This is based on the research I did with optimizing my machine for  
graphics.
I also wrote the following article:  
http://www.paradoxuncreated.com/articles/Millennium/Millennium.html
It is a bit outdated now, but I will update it with current information.
The value might iterate.

Peace Be With You,
Uwaysi Bin Kareem.


--- Kconfig.hzorig	2010-04-27 13:33:10.302162524 +0200
+++ Kconfig.hz	2010-04-27 20:39:54.736959816 +0200
@@ -45,6 +45,18 @@
  	 1000 Hz is the preferred choice for desktop systems and other
  	 systems requiring fast interactive responses to events.

+	config HZ_3956
+		bool "3956 HZ"
+	help
+	 3956 Hz is nearly the highest timer interrupt rate supported in the  
kernel.
+	 Graphics workstations, and OpenGL applications may benefit from this,
+	 since it gives the lowest framerate-jitter. The exact value 3956 is
+	 psychovisually-optimized, meaning that it aims for a level of jitter,
+	 percieved to be natural, and therefore non-nosiy. It is tuned for a
+	 profile of "where the human senses register the most information".
+	
+	
+
  endchoice

  config HZ
@@ -53,6 +65,7 @@
  	default 250 if HZ_250
  	default 300 if HZ_300
  	default 1000 if HZ_1000
+	default 3956 if HZ_3956

  config SCHED_HRTICK
  	def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] Psychovisually-optimized HZ setting (2.6.33.3)
  2010-04-27 19:03         ` [PATCH] Psychovisually-optimized HZ setting (2.6.33.3) Uwaysi Bin Kareem
@ 2010-04-27 19:51           ` Randy Dunlap
  2010-04-27 21:50           ` Valdis.Kletnieks
  1 sibling, 0 replies; 28+ messages in thread
From: Randy Dunlap @ 2010-04-27 19:51 UTC (permalink / raw)
  To: Uwaysi Bin Kareem; +Cc: linux-kernel

On Tue, 27 Apr 2010 21:03:11 +0200 Uwaysi Bin Kareem wrote:

> This is based on the research I did with optimizing my machine for  
> graphics.
> I also wrote the following article:  
> http://www.paradoxuncreated.com/articles/Millennium/Millennium.html
> It is a bit outdated now, but I will update it with current information.
> The value might iterate.

Hi,

What CPU architectures or platforms did you test this on?
Were any other kernel changes needed?


> Peace Be With You,
> Uwaysi Bin Kareem.
> 
> 
> --- Kconfig.hzorig	2010-04-27 13:33:10.302162524 +0200
> +++ Kconfig.hz	2010-04-27 20:39:54.736959816 +0200
> @@ -45,6 +45,18 @@
>   	 1000 Hz is the preferred choice for desktop systems and other
>   	 systems requiring fast interactive responses to events.
> 
> +	config HZ_3956
> +		bool "3956 HZ"
> +	help
> +	 3956 Hz is nearly the highest timer interrupt rate supported in the  
> kernel.
> +	 Graphics workstations, and OpenGL applications may benefit from this,

	drop first comma.

> +	 since it gives the lowest framerate-jitter. The exact value 3956 is
> +	 psychovisually-optimized, meaning that it aims for a level of jitter,
> +	 percieved to be natural, and therefore non-nosiy. It is tuned for a

	perceived                               non-noisy.

> +	 profile of "where the human senses register the most information".
> +	
> +	
> +
>   endchoice
> 
>   config HZ
> @@ -53,6 +65,7 @@
>   	default 250 if HZ_250
>   	default 300 if HZ_300
>   	default 1000 if HZ_1000
> +	default 3956 if HZ_3956
> 
>   config SCHED_HRTICK
>   	def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)
> 
> --


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] Psychovisually-optimized HZ setting (2.6.33.3)
  2010-04-27 19:03         ` [PATCH] Psychovisually-optimized HZ setting (2.6.33.3) Uwaysi Bin Kareem
  2010-04-27 19:51           ` Randy Dunlap
@ 2010-04-27 21:50           ` Valdis.Kletnieks
  1 sibling, 0 replies; 28+ messages in thread
From: Valdis.Kletnieks @ 2010-04-27 21:50 UTC (permalink / raw)
  To: Uwaysi Bin Kareem; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1234 bytes --]

On Tue, 27 Apr 2010 21:03:11 +0200, Uwaysi Bin Kareem said:

> http://www.paradoxuncreated.com/articles/Millennium/Millennium.html

> +	config HZ_3956
> +		bool "3956 HZ"
> +	help
> +	 3956 Hz is nearly the highest timer interrupt rate supported in the kernel.
> +	 Graphics workstations, and OpenGL applications may benefit from this,
> +	 since it gives the lowest framerate-jitter. The exact value 3956 is
> +	 psychovisually-optimized, meaning that it aims for a level of jitter,

Even after reading your link, it's unclear why 3956 and not 4000. All your link
said was "A granularity below 0.5 milliseconds, seems to suit the human
senses." - anything over 2000 meets that requirement.  Also, if your screen
refresh is sitting at 72hz or a bit under 14ms per refresh, any jitter under
that won't really matter much - it doesn't matter if your next frame is
ready 5ms early or 5.5ms early, you *still* have to wait for the next vertical
blanking interval or suffer tearing.

There's also the case of programs where HZ=300 would *make* the time budget,
but the added 3,356 timer interrupts and associated overhead would cause a
missed screen refresh.

I think you need more technical justification of why 3956 is better than 1000.

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
@ 2030-04-14  9:05 Zhang, Yanmin
  2010-04-14  9:20 ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Zhang, Yanmin @ 2030-04-14  9:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Avi Kivity, Sheng Yang, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

Here is the new patch of V3 against tip/master of April 13th
if anyone wants to try it.

ChangeLog V3:
	1) Add --guestmount=/dir/to/all/guestos parameter. Admin mounts guest os
	root directories under /dir/to/all/guestos by sshfs. For example, I start
	2 guest os. The one's pid is 8888 and the other's is 9999.
	#mkdir ~/guestmount; cd ~/guestmount
	#sshfs -o allow_other,direct_io -p 5551 localhost:/ 8888/
	#sshfs -o allow_other,direct_io -p 5552 localhost:/ 9999/
	#perf kvm --host --guest --guestmount=~/guestmount top

	The old --guestkallsyms and --guestmodules are still supported as default
	guest os symbol parsing.

	2) Add guest os buildid support.
	3) Add sub command 'perf kvm buildid-list'.
	4) Delete sub command 'perf kvm stat', because our current implementation
	doesn't transfer guest/host requirement to kernel, and kernel always
	collects both host and guest statistics. So regular 'perf stat' is ok.
	5) Fix a couple of perf bugs.
	6) We still have no support on command with parameter 'any' as current KVM
	just uses process id to identify specific guest os instance. Users could
	uses parameter -p to collect specific guest os instance statistics.

ChangeLog V2:
        1) Based on Avi's suggestion, I moved callback functions
        to generic code area. So the kernel part of the patch is
        clearer.
        2) Add 'perf kvm stat'.


From: Zhang, Yanmin <yanmin_zhang@linux.intel.com>

Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.

The patch adds new sub command kvm to perf.

  perf kvm top
  perf kvm record
  perf kvm report
  perf kvm diff
  perf kvm buildid-list

The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.

Below are some examples.
1) perf kvm top
[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top

--------------------------------------------------------------------------------------------------------------------------
   PerfTop:   16010 irqs/sec  kernel:59.1% us: 1.5% guest kernel:31.9% guest us: 7.5% exact:  0.0% [1000Hz cycles],  (all, 16 CPUs)
--------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                  DSO
             _______ _____ _________________________ _______________________

            38770.00 20.4% __ticket_spin_lock        [guest.kernel.kallsyms]
            22560.00 11.9% ftrace_likely_update      [kernel.kallsyms]
             9208.00  4.8% __lock_acquire            [kernel.kallsyms]
             5473.00  2.9% trace_hardirqs_off_caller [kernel.kallsyms]
             5222.00  2.7% copy_user_generic_string  [guest.kernel.kallsyms]
             4450.00  2.3% validate_chain            [kernel.kallsyms]
             4262.00  2.2% trace_hardirqs_on_caller  [kernel.kallsyms]
             4239.00  2.2% do_raw_spin_lock          [kernel.kallsyms]
             3548.00  1.9% do_raw_spin_unlock        [kernel.kallsyms]
             2487.00  1.3% lock_release              [kernel.kallsyms]
             2165.00  1.1% __local_bh_disable        [kernel.kallsyms]
             1905.00  1.0% check_chain_key           [kernel.kallsyms]
             1737.00  0.9% lock_acquire              [kernel.kallsyms]
             1604.00  0.8% tcp_recvmsg               [kernel.kallsyms]
             1524.00  0.8% mark_lock                 [kernel.kallsyms]
             1464.00  0.8% schedule                  [kernel.kallsyms]
             1423.00  0.7% __d_lookup                [guest.kernel.kallsyms]

If you want to just show host data, pls. don't use parameter --guest.
The headline includes guest os kernel and userspace percentage.

2) perf kvm record
[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules record -f -a sleep 60
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 29.385 MB perf.data.kvm (~1283837 samples) ]

3) perf kvm report
        3.1) [root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules report --sort pid --showcpuutilization>norm.host.guest.report.pid
# Samples: 424719292247
#
# Overhead  sys    us    guest sys    guest us            Command:  Pid
# ........  .....................
#
    50.57%     1.02%     0.00%    39.97%     9.58%  qemu-system-x86: 3587
    49.32%     1.35%     0.01%    35.20%    12.76%  qemu-system-x86: 3347
     0.07%     0.07%     0.00%     0.00%     0.00%             perf: 5217


Some performance guys require perf to show sys/us/guest_sys/guest_us per KVM guest
instance which is actually just a multi-threaded process. Above sub parameter --showcpuutilization
does so.

        3.2) [root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules report >norm.host.guest.report
# Samples: 2466991384118
#
# Overhead          Command                                                             Shared Object  Symbol
# ........  ...............  ........................................................................  ......
#
    29.11%  qemu-system-x86  [guest.kernel.kallsyms]                                                   [g] __ticket_spin_lock
     5.88%       tbench_srv  [kernel.kallsyms]                                                         [k] ftrace_likely_update
     5.76%           tbench  [kernel.kallsyms]                                                         [k] ftrace_likely_update
     3.88%  qemu-system-x86                                                                34c3255482  [u] 0x000034c3255482
     1.83%           tbench  [kernel.kallsyms]                                                         [k] __lock_acquire
     1.81%       tbench_srv  [kernel.kallsyms]                                                         [k] __lock_acquire
     1.38%       tbench_srv  [kernel.kallsyms]                                                         [k] trace_hardirqs_off_caller
     1.37%           tbench  [kernel.kallsyms]                                                         [k] trace_hardirqs_off_caller
     1.13%  qemu-system-x86  [guest.kernel.kallsyms]                                                   [g] copy_user_generic_string
     1.04%       tbench_srv  [kernel.kallsyms]                                                         [k] validate_chain
     1.00%           tbench  [kernel.kallsyms]                                                         [k] trace_hardirqs_on_caller
     1.00%       tbench_srv  [kernel.kallsyms]                                                         [k] trace_hardirqs_on_caller
     0.95%           tbench  [kernel.kallsyms]                                                         [k] do_raw_spin_lock


[u] means it's in guest os user space. [g] means in guest os kernel. Other info is very direct.
If it shows a module such like [ext4], it means guest kernel module, because native host kernel's
modules are start from something like /lib/modules/XXX.

4) --guestmount example. I started 2 guest os. Run dbench testing in the 1st and tbench in 2nd guest os.
[root@lkp-ne01 norm]#perf kvm --host --guest --guestmount=/home/ymzhang/guestmount/ top
---------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:   15972 irqs/sec  kernel: 8.3% us: 0.5% guest kernel:73.9% guest us:17.3% exact:  0.0% [1000Hz cycles],  (all, 16 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                  DSO
             _______ _____ _________________________ __________________________________________________

            32960.00 17.4% __ticket_spin_lock        [guest.kernel.kallsyms]                           
             5464.00  2.9% copy_user_generic_string  [guest.kernel.kallsyms]                           
             4069.00  2.1% copy_user_generic_string  [guest.kernel.kallsyms]                           
             3238.00  1.7% ftrace_likely_update      /lib/modules/2.6.34-rc4-tip-yangkvm+/build/vmlinux
             2997.00  1.6% __lock_acquire            /lib/modules/2.6.34-rc4-tip-yangkvm+/build/vmlinux
             2797.00  1.5% tcp_sendmsg               [guest.kernel.kallsyms]                           
             2703.00  1.4% schedule                  [guest.kernel.kallsyms]                           
             2384.00  1.3% __switch_to               [guest.kernel.kallsyms]                           
             2125.00  1.1% tcp_ack                   [guest.kernel.kallsyms]                           
             2045.00  1.1% tcp_recvmsg               [guest.kernel.kallsyms]                           
             1862.00  1.0% tcp_transmit_skb          [guest.kernel.kallsyms]                           
             1734.00  0.9% __ticket_spin_lock        [guest.kernel.kallsyms]                           
             1388.00  0.7% lock_release              /lib/modules/2.6.34-rc4-tip-yangkvm+/build/vmlinux
             1367.00  0.7% update_curr               [guest.kernel.kallsyms]                           
             1339.00  0.7% fget_light                [guest.kernel.kallsyms]                           
             1332.00  0.7% put_page                  [guest.kernel.kallsyms]                           
             1324.00  0.7% ip_queue_xmit             [guest.kernel.kallsyms]                           
             1296.00  0.7% __d_lookup                [guest.kernel.kallsyms]                           
             1296.00  0.7% tcp_rcv_established       [guest.kernel.kallsyms]                           
             1230.00  0.6% tcp_v4_rcv                [guest.kernel.kallsyms]                           
             1092.00  0.6% dev_queue_xmit            [guest.kernel.kallsyms]                           
             1073.00  0.6% kmem_cache_alloc          [guest.kernel.kallsyms]                           
             1066.00  0.6% ip_rcv                    [guest.kernel.kallsyms]                           
             1049.00  0.6% __inet_lookup_established [guest.kernel.kallsyms]                           
             1048.00  0.6% tcp_write_xmit            [guest.kernel.kallsyms]                           


Below is the patch against tip/master tree of 13th April.

Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>

---

diff -Nraup linux-2.6_tip0413/arch/x86/include/asm/perf_event.h linux-2.6_tip0413_perfkvm/arch/x86/include/asm/perf_event.h
--- linux-2.6_tip0413/arch/x86/include/asm/perf_event.h	2010-04-14 11:11:03.992966568 +0800
+++ linux-2.6_tip0413_perfkvm/arch/x86/include/asm/perf_event.h	2010-04-14 11:13:17.261881591 +0800
@@ -135,17 +135,10 @@ extern void perf_events_lapic_init(void)
  */
 #define PERF_EFLAGS_EXACT	(1UL << 3)
 
-#define perf_misc_flags(regs)				\
-({	int misc = 0;					\
-	if (user_mode(regs))				\
-		misc |= PERF_RECORD_MISC_USER;		\
-	else						\
-		misc |= PERF_RECORD_MISC_KERNEL;	\
-	if (regs->flags & PERF_EFLAGS_EXACT)		\
-		misc |= PERF_RECORD_MISC_EXACT;		\
-	misc; })
-
-#define perf_instruction_pointer(regs)	((regs)->ip)
+struct pt_regs;
+extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
+extern unsigned long perf_misc_flags(struct pt_regs *regs);
+#define perf_misc_flags(regs)	perf_misc_flags(regs)
 
 #else
 static inline void init_hw_perf_events(void)		{ }
diff -Nraup linux-2.6_tip0413/arch/x86/kernel/cpu/perf_event.c linux-2.6_tip0413_perfkvm/arch/x86/kernel/cpu/perf_event.c
--- linux-2.6_tip0413/arch/x86/kernel/cpu/perf_event.c	2010-04-14 11:11:04.825028810 +0800
+++ linux-2.6_tip0413_perfkvm/arch/x86/kernel/cpu/perf_event.c	2010-04-14 17:02:12.198063684 +0800
@@ -1720,6 +1720,11 @@ struct perf_callchain_entry *perf_callch
 {
 	struct perf_callchain_entry *entry;
 
+	if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+		/* TODO: We don't support guest os callchain now */
+		return NULL;
+	}
+
 	if (in_nmi())
 		entry = &__get_cpu_var(pmc_nmi_entry);
 	else
@@ -1743,3 +1748,30 @@ void perf_arch_fetch_caller_regs(struct 
 	regs->cs = __KERNEL_CS;
 	local_save_flags(regs->flags);
 }
+
+unsigned long perf_instruction_pointer(struct pt_regs *regs)
+{
+	unsigned long ip;
+	if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
+		ip = perf_guest_cbs->get_guest_ip();
+	else
+		ip = instruction_pointer(regs);
+	return ip;
+}
+
+unsigned long perf_misc_flags(struct pt_regs *regs)
+{
+	int misc = 0;
+	if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+		misc |= perf_guest_cbs->is_user_mode() ?
+			PERF_RECORD_MISC_GUEST_USER :
+			PERF_RECORD_MISC_GUEST_KERNEL;
+	} else
+		misc |= user_mode(regs) ? PERF_RECORD_MISC_USER :
+			PERF_RECORD_MISC_KERNEL;
+	if (regs->flags & PERF_EFLAGS_EXACT)
+		misc |= PERF_RECORD_MISC_EXACT;
+
+	return misc;
+}
+
diff -Nraup linux-2.6_tip0413/arch/x86/kvm/x86.c linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c
--- linux-2.6_tip0413/arch/x86/kvm/x86.c	2010-04-14 11:11:04.341042024 +0800
+++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c	2010-04-14 11:32:45.841278890 +0800
@@ -3765,6 +3765,35 @@ static void kvm_timer_init(void)
 	}
 }
 
+static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
+
+static int kvm_is_in_guest(void)
+{
+	return percpu_read(current_vcpu) != NULL;
+}
+
+static int kvm_is_user_mode(void)
+{
+	int user_mode = 3;
+	if (percpu_read(current_vcpu))
+		user_mode = kvm_x86_ops->get_cpl(percpu_read(current_vcpu));
+	return user_mode != 0;
+}
+
+static unsigned long kvm_get_guest_ip(void)
+{
+	unsigned long ip = 0;
+	if (percpu_read(current_vcpu))
+		ip = kvm_rip_read(percpu_read(current_vcpu));
+	return ip;
+}
+
+static struct perf_guest_info_callbacks kvm_guest_cbs = {
+	.is_in_guest		= kvm_is_in_guest,
+	.is_user_mode		= kvm_is_user_mode,
+	.get_guest_ip		= kvm_get_guest_ip,
+};
+
 int kvm_arch_init(void *opaque)
 {
 	int r;
@@ -3801,6 +3830,8 @@ int kvm_arch_init(void *opaque)
 
 	kvm_timer_init();
 
+	perf_register_guest_info_callbacks(&kvm_guest_cbs);
+
 	return 0;
 
 out:
@@ -3809,6 +3840,8 @@ out:
 
 void kvm_arch_exit(void)
 {
+	perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
+
 	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
 		cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
 					    CPUFREQ_TRANSITION_NOTIFIER);
@@ -4339,7 +4372,10 @@ static int vcpu_enter_guest(struct kvm_v
 	}
 
 	trace_kvm_entry(vcpu->vcpu_id);
+
+	percpu_write(current_vcpu, vcpu);
 	kvm_x86_ops->run(vcpu);
+	percpu_write(current_vcpu, NULL);
 
 	/*
 	 * If the guest has used debug registers, at least dr7
diff -Nraup linux-2.6_tip0413/include/linux/perf_event.h linux-2.6_tip0413_perfkvm/include/linux/perf_event.h
--- linux-2.6_tip0413/include/linux/perf_event.h	2010-04-14 11:11:16.922212684 +0800
+++ linux-2.6_tip0413_perfkvm/include/linux/perf_event.h	2010-04-14 11:34:33.478072738 +0800
@@ -288,11 +288,13 @@ struct perf_event_mmap_page {
 	__u64	data_tail;		/* user-space written tail */
 };
 
-#define PERF_RECORD_MISC_CPUMODE_MASK		(3 << 0)
+#define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
 #define PERF_RECORD_MISC_CPUMODE_UNKNOWN	(0 << 0)
 #define PERF_RECORD_MISC_KERNEL			(1 << 0)
 #define PERF_RECORD_MISC_USER			(2 << 0)
 #define PERF_RECORD_MISC_HYPERVISOR		(3 << 0)
+#define PERF_RECORD_MISC_GUEST_KERNEL		(4 << 0)
+#define PERF_RECORD_MISC_GUEST_USER		(5 << 0)
 
 #define PERF_RECORD_MISC_EXACT			(1 << 14)
 /*
@@ -446,6 +448,12 @@ enum perf_callchain_context {
 # include <asm/perf_event.h>
 #endif
 
+struct perf_guest_info_callbacks {
+	int (*is_in_guest) (void);
+	int (*is_user_mode) (void);
+	unsigned long (*get_guest_ip) (void);
+};
+
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 #include <asm/hw_breakpoint.h>
 #endif
@@ -920,6 +928,12 @@ static inline void perf_event_mmap(struc
 		__perf_event_mmap(vma);
 }
 
+extern struct perf_guest_info_callbacks *perf_guest_cbs;
+extern int perf_register_guest_info_callbacks(
+		struct perf_guest_info_callbacks *);
+extern int perf_unregister_guest_info_callbacks(
+		struct perf_guest_info_callbacks *);
+
 extern void perf_event_comm(struct task_struct *tsk);
 extern void perf_event_fork(struct task_struct *tsk);
 
@@ -989,6 +1003,11 @@ perf_sw_event(u32 event_id, u64 nr, int 
 static inline void
 perf_bp_event(struct perf_event *event, void *data)			{ }
 
+static inline int perf_register_guest_info_callbacks
+(struct perf_guest_info_callbacks *) {return 0; }
+static inline int perf_unregister_guest_info_callbacks
+(struct perf_guest_info_callbacks *) {return 0; }
+
 static inline void perf_event_mmap(struct vm_area_struct *vma)		{ }
 static inline void perf_event_comm(struct task_struct *tsk)		{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
diff -Nraup linux-2.6_tip0413/kernel/perf_event.c linux-2.6_tip0413_perfkvm/kernel/perf_event.c
--- linux-2.6_tip0413/kernel/perf_event.c	2010-04-14 11:12:04.090770764 +0800
+++ linux-2.6_tip0413_perfkvm/kernel/perf_event.c	2010-04-14 11:13:17.265859229 +0800
@@ -2797,6 +2797,27 @@ void perf_arch_fetch_caller_regs(struct 
 
 
 /*
+ * We assume there is only KVM supporting the callbacks.
+ * Later on, we might change it to a list if there is
+ * another virtualization implementation supporting the callbacks.
+ */
+struct perf_guest_info_callbacks *perf_guest_cbs;
+
+int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
+{
+	perf_guest_cbs = cbs;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
+
+int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
+{
+	perf_guest_cbs = NULL;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
+
+/*
  * Output
  */
 static bool perf_output_space(struct perf_mmap_data *data, unsigned long tail,
@@ -3748,7 +3769,7 @@ void __perf_event_mmap(struct vm_area_st
 		.event_id  = {
 			.header = {
 				.type = PERF_RECORD_MMAP,
-				.misc = 0,
+				.misc = PERF_RECORD_MISC_USER,
 				/* .size */
 			},
 			/* .pid */
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-annotate.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-annotate.c
--- linux-2.6_tip0413/tools/perf/builtin-annotate.c	2010-04-14 11:11:58.474229259 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-annotate.c	2010-04-14 11:13:17.269859901 +0800
@@ -571,7 +571,7 @@ static int __cmd_annotate(void)
 		perf_session__fprintf(session, stdout);
 
 	if (verbose > 2)
-		dsos__fprintf(stdout);
+		dsos__fprintf(&session->kerninfo_root, stdout);
 
 	perf_session__collapse_resort(&session->hists);
 	perf_session__output_resort(&session->hists, session->event_total[0]);
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-buildid-list.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-buildid-list.c
--- linux-2.6_tip0413/tools/perf/builtin-buildid-list.c	2010-04-14 11:11:58.462227060 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-buildid-list.c	2010-04-14 11:13:17.269859901 +0800
@@ -46,7 +46,7 @@ static int __cmd_buildid_list(void)
 	if (with_hits)
 		perf_session__process_events(session, &build_id__mark_dso_hit_ops);
 
-	dsos__fprintf_buildid(stdout, with_hits);
+	dsos__fprintf_buildid(&session->kerninfo_root, stdout, with_hits);
 
 	perf_session__delete(session);
 	return err;
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-diff.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-diff.c
--- linux-2.6_tip0413/tools/perf/builtin-diff.c	2010-04-14 11:11:58.426247688 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-diff.c	2010-04-14 11:35:43.245364332 +0800
@@ -33,7 +33,7 @@ static int perf_session__add_hist_entry(
 		return -ENOMEM;
 
 	if (hit)
-		he->count += count;
+		__perf_session__add_count(he, al, count);
 
 	return 0;
 }
@@ -225,6 +225,10 @@ int cmd_diff(int argc, const char **argv
 			input_new = argv[1];
 		} else
 			input_new = argv[0];
+	} else if (symbol_conf.default_guest_vmlinux_name ||
+		   symbol_conf.default_guest_kallsyms) {
+		input_old = "perf.data.host";
+		input_new = "perf.data.guest";
 	}
 
 	symbol_conf.exclude_other = false;
diff -Nraup linux-2.6_tip0413/tools/perf/builtin.h linux-2.6_tip0413_perfkvm/tools/perf/builtin.h
--- linux-2.6_tip0413/tools/perf/builtin.h	2010-04-14 11:11:58.234222967 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin.h	2010-04-14 11:13:17.313858518 +0800
@@ -32,5 +32,6 @@ extern int cmd_version(int argc, const c
 extern int cmd_probe(int argc, const char **argv, const char *prefix);
 extern int cmd_kmem(int argc, const char **argv, const char *prefix);
 extern int cmd_lock(int argc, const char **argv, const char *prefix);
+extern int cmd_kvm(int argc, const char **argv, const char *prefix);
 
 #endif
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-kmem.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-kmem.c
--- linux-2.6_tip0413/tools/perf/builtin-kmem.c	2010-04-14 11:11:58.806260439 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-kmem.c	2010-04-14 11:39:10.199395473 +0800
@@ -351,6 +351,7 @@ static void __print_result(struct rb_roo
 			   int n_lines, int is_caller)
 {
 	struct rb_node *next;
+	struct kernel_info *kerninfo;
 
 	printf("%.102s\n", graph_dotted_line);
 	printf(" %-34s |",  is_caller ? "Callsite": "Alloc Ptr");
@@ -359,6 +360,11 @@ static void __print_result(struct rb_roo
 
 	next = rb_first(root);
 
+	kerninfo = kerninfo__findhost(&session->kerninfo_root);
+	if (!kerninfo) {
+		pr_err("__print_result: couldn't find kernel information\n");
+		return;
+	}
 	while (next && n_lines--) {
 		struct alloc_stat *data = rb_entry(next, struct alloc_stat,
 						   node);
@@ -370,7 +376,7 @@ static void __print_result(struct rb_roo
 		if (is_caller) {
 			addr = data->call_site;
 			if (!raw_ip)
-				sym = map_groups__find_function(&session->kmaps,
+				sym = map_groups__find_function(&kerninfo->kmaps,
 								addr, &map, NULL);
 		} else
 			addr = data->ptr;
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-kvm.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-kvm.c
--- linux-2.6_tip0413/tools/perf/builtin-kvm.c	1970-01-01 08:00:00.000000000 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-kvm.c	2010-04-14 11:40:06.551652083 +0800
@@ -0,0 +1,145 @@
+#include "builtin.h"
+#include "perf.h"
+
+#include "util/util.h"
+#include "util/cache.h"
+#include "util/symbol.h"
+#include "util/thread.h"
+#include "util/header.h"
+#include "util/session.h"
+
+#include "util/parse-options.h"
+#include "util/trace-event.h"
+
+#include "util/debug.h"
+
+#include <sys/prctl.h>
+
+#include <semaphore.h>
+#include <pthread.h>
+#include <math.h>
+
+static char			*file_name = NULL;
+static char			name_buffer[256];
+
+int				perf_host = 1;
+int				perf_guest = 0;
+
+static const char * const kvm_usage[] = {
+	"perf kvm [<options>] {top|record|report|diff}",
+	NULL
+};
+
+static const struct option kvm_options[] = {
+	OPT_STRING('i', "input", &file_name, "file",
+		   "Input file name"),
+	OPT_STRING('o', "output", &file_name, "file",
+		   "Output file name"),
+	OPT_BOOLEAN(0, "guest", &perf_guest,
+		    "Collect guest os data"),
+	OPT_BOOLEAN(0, "host", &perf_host,
+		    "Collect guest os data"),
+	OPT_STRING(0, "guestmount", &symbol_conf.guestmount, "directory",
+		   "guest mount directory under which every guest os instance has a subdir"),
+	OPT_STRING(0, "guestvmlinux", &symbol_conf.default_guest_vmlinux_name, "file",
+		   "file saving guest os vmlinux"),
+	OPT_STRING(0, "guestkallsyms", &symbol_conf.default_guest_kallsyms, "file",
+		   "file saving guest os /proc/kallsyms"),
+	OPT_STRING(0, "guestmodules", &symbol_conf.default_guest_modules, "file",
+		   "file saving guest os /proc/modules"),
+	OPT_END()
+};
+
+static int __cmd_record(int argc, const char **argv)
+{
+	int rec_argc, i = 0, j;
+	const char **rec_argv;
+
+	rec_argc = argc + 2;
+	rec_argv = calloc(rec_argc + 1, sizeof(char *));
+	rec_argv[i++] = strdup("record");
+	rec_argv[i++] = strdup("-o");
+	rec_argv[i++] = strdup(file_name);
+	for (j = 1; j < argc; j++, i++)
+		rec_argv[i] = argv[j];
+
+	BUG_ON(i != rec_argc);
+
+	return cmd_record(i, rec_argv, NULL);
+}
+
+static int __cmd_report(int argc, const char **argv)
+{
+	int rec_argc, i = 0, j;
+	const char **rec_argv;
+
+	rec_argc = argc + 2;
+	rec_argv = calloc(rec_argc + 1, sizeof(char *));
+	rec_argv[i++] = strdup("report");
+	rec_argv[i++] = strdup("-i");
+	rec_argv[i++] = strdup(file_name);
+	for (j = 1; j < argc; j++, i++)
+		rec_argv[i] = argv[j];
+
+	BUG_ON(i != rec_argc);
+
+	return cmd_report(i, rec_argv, NULL);
+}
+
+static int __cmd_buildid_list(int argc, const char **argv)
+{
+	int rec_argc, i = 0, j;
+	const char **rec_argv;
+
+	rec_argc = argc + 2;
+	rec_argv = calloc(rec_argc + 1, sizeof(char *));
+	rec_argv[i++] = strdup("buildid-list");
+	rec_argv[i++] = strdup("-i");
+	rec_argv[i++] = strdup(file_name);
+	for (j = 1; j < argc; j++, i++)
+		rec_argv[i] = argv[j];
+
+	BUG_ON(i != rec_argc);
+
+	return cmd_buildid_list(i, rec_argv, NULL);
+}
+
+int cmd_kvm(int argc, const char **argv, const char *prefix __used)
+{
+	perf_host = perf_guest = 0;
+
+	argc = parse_options(argc, argv, kvm_options, kvm_usage,
+			PARSE_OPT_STOP_AT_NON_OPTION);
+	if (!argc)
+		usage_with_options(kvm_usage, kvm_options);
+
+	if (!perf_host)
+		perf_guest = 1;
+
+	if (!file_name) {
+		if (perf_host && !perf_guest)
+			sprintf(name_buffer, "perf.data.host");
+		else if (!perf_host && perf_guest)
+			sprintf(name_buffer, "perf.data.guest");
+		else
+			sprintf(name_buffer, "perf.data.kvm");
+		file_name = name_buffer;
+	}
+
+	if (!strncmp(argv[0], "rec", 3)) {
+		return __cmd_record(argc, argv);
+	} else if (!strncmp(argv[0], "rep", 3)) {
+		return __cmd_report(argc, argv);
+	} else if (!strncmp(argv[0], "diff", 4)) {
+		return cmd_diff(argc, argv, NULL);
+	} else if (!strncmp(argv[0], "top", 3)) {
+		return cmd_top(argc, argv, NULL);
+	} else if (!strncmp(argv[0], "buildid-list", 12)) {
+		return __cmd_buildid_list(argc, argv);
+	} else {
+		usage_with_options(kvm_usage, kvm_options);
+	}
+
+	return 0;
+}
+
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-record.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-record.c
--- linux-2.6_tip0413/tools/perf/builtin-record.c	2010-04-14 11:11:58.806260439 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-record.c	2010-04-14 14:11:09.625252460 +0800
@@ -426,6 +426,52 @@ static void atexit_header(void)
 	perf_header__write(&session->header, output, true);
 }
 
+static void event__synthesize_guest_os(struct kernel_info *kerninfo,
+		void *data __attribute__((unused)))
+{
+	int err;
+	char *guest_kallsyms;
+	char path[PATH_MAX];
+
+	if (is_host_kernel(kerninfo))
+		return;
+
+	/*
+	 *As for guest kernel when processing subcommand record&report,
+	 *we arrange module mmap prior to guest kernel mmap and trigger
+	 *a preload dso because default guest module symbols are loaded
+	 *from guest kallsyms instead of /lib/modules/XXX/XXX. This
+	 *method is used to avoid symbol missing when the first addr is
+	 *in module instead of in guest kernel.
+	 */
+	err = event__synthesize_modules(process_synthesized_event,
+			session,
+			kerninfo);
+	if (err < 0)
+		pr_err("Couldn't record guest kernel [%d]'s reference"
+			" relocation symbol.\n", kerninfo->pid);
+
+	if (is_default_guest(kerninfo))
+		guest_kallsyms = (char *) symbol_conf.default_guest_kallsyms;
+	else {
+		sprintf(path, "%s/proc/kallsyms", kerninfo->root_dir);
+		guest_kallsyms = path;
+	}
+
+	/*
+	 * We use _stext for guest kernel because guest kernel's /proc/kallsyms
+	 * have no _text sometimes.
+	 */
+	err = event__synthesize_kernel_mmap(process_synthesized_event,
+			session, kerninfo, "_text");
+	if (err < 0)
+		err = event__synthesize_kernel_mmap(process_synthesized_event,
+				session, kerninfo, "_stext");
+	if (err < 0)
+		pr_err("Couldn't record guest kernel [%d]'s reference"
+			" relocation symbol.\n", kerninfo->pid);
+}
+
 static int __cmd_record(int argc, const char **argv)
 {
 	int i, counter;
@@ -437,6 +483,7 @@ static int __cmd_record(int argc, const 
 	int child_ready_pipe[2], go_pipe[2];
 	const bool forks = argc > 0;
 	char buf;
+	struct kernel_info *kerninfo;
 
 	page_size = sysconf(_SC_PAGE_SIZE);
 
@@ -572,21 +619,31 @@ static int __cmd_record(int argc, const 
 
 	post_processing_offset = lseek(output, 0, SEEK_CUR);
 
+	kerninfo = kerninfo__findhost(&session->kerninfo_root);
+	if (!kerninfo) {
+		pr_err("Couldn't find native kernel information.\n");
+		return -1;
+	}
+
 	err = event__synthesize_kernel_mmap(process_synthesized_event,
-					    session, "_text");
+			session, kerninfo, "_text");
 	if (err < 0)
 		err = event__synthesize_kernel_mmap(process_synthesized_event,
-						    session, "_stext");
+				session, kerninfo, "_stext");
 	if (err < 0) {
 		pr_err("Couldn't record kernel reference relocation symbol.\n");
 		return err;
 	}
 
-	err = event__synthesize_modules(process_synthesized_event, session);
+	err = event__synthesize_modules(process_synthesized_event,
+				session, kerninfo);
 	if (err < 0) {
 		pr_err("Couldn't record kernel reference relocation symbol.\n");
 		return err;
 	}
+	if (perf_guest)
+		kerninfo__process_allkernels(&session->kerninfo_root,
+			event__synthesize_guest_os, session);
 
 	if (!system_wide && profile_cpu == -1)
 		event__synthesize_thread(target_tid, process_synthesized_event,
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-report.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-report.c
--- linux-2.6_tip0413/tools/perf/builtin-report.c	2010-04-14 11:11:58.462227060 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-report.c	2010-04-14 11:13:17.313858518 +0800
@@ -108,7 +108,7 @@ static int perf_session__add_hist_entry(
 		return -ENOMEM;
 
 	if (hit)
-		he->count += data->period;
+		__perf_session__add_count(he, al,  data->period);
 
 	if (symbol_conf.use_callchain) {
 		if (!hit)
@@ -300,7 +300,7 @@ static int __cmd_report(void)
 		perf_session__fprintf(session, stdout);
 
 	if (verbose > 2)
-		dsos__fprintf(stdout);
+		dsos__fprintf(&session->kerninfo_root, stdout);
 
 	next = rb_first(&session->stats_by_id);
 	while (next) {
@@ -437,6 +437,8 @@ static const struct option options[] = {
 		   "sort by key(s): pid, comm, dso, symbol, parent"),
 	OPT_BOOLEAN('P', "full-paths", &symbol_conf.full_paths,
 		    "Don't shorten the pathnames taking into account the cwd"),
+	OPT_BOOLEAN(0, "showcpuutilization", &symbol_conf.show_cpu_utilization,
+		    "Show sample percentage for different cpu modes"),
 	OPT_STRING('p', "parent", &parent_pattern, "regex",
 		   "regex filter to identify parent, see: '--sort parent'"),
 	OPT_BOOLEAN('x', "exclude-other", &symbol_conf.exclude_other,
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-top.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-top.c
--- linux-2.6_tip0413/tools/perf/builtin-top.c	2010-04-14 11:11:58.458238567 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-top.c	2010-04-14 14:28:14.576215651 +0800
@@ -420,8 +420,9 @@ static double sym_weight(const struct sy
 }
 
 static long			samples;
-static long			userspace_samples;
+static long			kernel_samples, us_samples;
 static long			exact_samples;
+static long			guest_us_samples, guest_kernel_samples;
 static const char		CONSOLE_CLEAR[] = "^[[H^[[2J";
 
 static void __list_insert_active_sym(struct sym_entry *syme)
@@ -461,7 +462,10 @@ static void print_sym_table(void)
 	int printed = 0, j;
 	int counter, snap = !display_weighted ? sym_counter : 0;
 	float samples_per_sec = samples/delay_secs;
-	float ksamples_per_sec = (samples-userspace_samples)/delay_secs;
+	float ksamples_per_sec = kernel_samples/delay_secs;
+	float us_samples_per_sec = (us_samples)/delay_secs;
+	float guest_kernel_samples_per_sec = (guest_kernel_samples)/delay_secs;
+	float guest_us_samples_per_sec = (guest_us_samples)/delay_secs;
 	float esamples_percent = (100.0*exact_samples)/samples;
 	float sum_ksamples = 0.0;
 	struct sym_entry *syme, *n;
@@ -470,7 +474,8 @@ static void print_sym_table(void)
 	int sym_width = 0, dso_width = 0, dso_short_width = 0;
 	const int win_width = winsize.ws_col - 1;
 
-	samples = userspace_samples = exact_samples = 0;
+	samples = us_samples = kernel_samples = exact_samples = 0;
+	guest_kernel_samples = guest_us_samples = 0;
 
 	/* Sort the active symbols */
 	pthread_mutex_lock(&active_symbols_lock);
@@ -501,10 +506,21 @@ static void print_sym_table(void)
 	puts(CONSOLE_CLEAR);
 
 	printf("%-*.*s\n", win_width, win_width, graph_dotted_line);
-	printf( "   PerfTop:%8.0f irqs/sec  kernel:%4.1f%%  exact: %4.1f%% [",
-		samples_per_sec,
-		100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)),
-		esamples_percent);
+	if (!perf_guest) {
+		printf( "   PerfTop:%8.0f irqs/sec  kernel:%4.1f%%  exact: %4.1f%% [",
+			samples_per_sec,
+			100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)),
+			esamples_percent);
+	} else {
+		printf( "   PerfTop:%8.0f irqs/sec  kernel:%4.1f%% us:%4.1f%%"
+			" guest kernel:%4.1f%% guest us:%4.1f%% exact: %4.1f%% [",
+			samples_per_sec,
+			100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)),
+			100.0 - (100.0*((samples_per_sec-us_samples_per_sec)/samples_per_sec)),
+			100.0 - (100.0*((samples_per_sec-guest_kernel_samples_per_sec)/samples_per_sec)),
+			100.0 - (100.0*((samples_per_sec-guest_us_samples_per_sec)/samples_per_sec)),
+			esamples_percent);
+	}
 
 	if (nr_counters == 1 || !display_weighted) {
 		printf("%Ld", (u64)attrs[0].sample_period);
@@ -597,7 +613,6 @@ static void print_sym_table(void)
 
 		syme = rb_entry(nd, struct sym_entry, rb_node);
 		sym = sym_entry__symbol(syme);
-
 		if (++printed > print_entries || (int)syme->snap_count < count_filter)
 			continue;
 
@@ -761,7 +776,7 @@ static int key_mapped(int c)
 	return 0;
 }
 
-static void handle_keypress(int c)
+static void handle_keypress(struct perf_session *session, int c)
 {
 	if (!key_mapped(c)) {
 		struct pollfd stdin_poll = { .fd = 0, .events = POLLIN };
@@ -830,7 +845,7 @@ static void handle_keypress(int c)
 		case 'Q':
 			printf("exiting.\n");
 			if (dump_symtab)
-				dsos__fprintf(stderr);
+				dsos__fprintf(&session->kerninfo_root, stderr);
 			exit(0);
 		case 's':
 			prompt_symbol(&sym_filter_entry, "Enter details symbol");
@@ -866,6 +881,7 @@ static void *display_thread(void *arg __
 	struct pollfd stdin_poll = { .fd = 0, .events = POLLIN };
 	struct termios tc, save;
 	int delay_msecs, c;
+	struct perf_session *session = (struct perf_session *) arg;
 
 	tcgetattr(0, &save);
 	tc = save;
@@ -886,7 +902,7 @@ repeat:
 	c = getc(stdin);
 	tcsetattr(0, TCSAFLUSH, &save);
 
-	handle_keypress(c);
+	handle_keypress(session, c);
 	goto repeat;
 
 	return NULL;
@@ -957,24 +973,46 @@ static void event__process_sample(const 
 	u64 ip = self->ip.ip;
 	struct sym_entry *syme;
 	struct addr_location al;
+	struct kernel_info *kerninfo;
 	u8 origin = self->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
 
 	++samples;
 
 	switch (origin) {
 	case PERF_RECORD_MISC_USER:
-		++userspace_samples;
+		++us_samples;
 		if (hide_user_symbols)
 			return;
+		kerninfo = kerninfo__findhost(&session->kerninfo_root);
 		break;
 	case PERF_RECORD_MISC_KERNEL:
+		++kernel_samples;
 		if (hide_kernel_symbols)
 			return;
+		kerninfo = kerninfo__findhost(&session->kerninfo_root);
 		break;
+	case PERF_RECORD_MISC_GUEST_KERNEL:
+		++guest_kernel_samples;
+		kerninfo = kerninfo__find(&session->kerninfo_root,
+					  self->ip.pid);
+		break;
+	case PERF_RECORD_MISC_GUEST_USER:
+		++guest_us_samples;
+		/*
+		 * TODO: we don't process guest user from host side
+		 * except simple counting 
+		 */
+		return;
 	default:
 		return;
 	}
 
+	if (!kerninfo && perf_guest) {
+		pr_err("Can't find guest [%d]'s kernel information\n",
+			self->ip.pid);
+		return;
+	}
+
 	if (self->header.misc & PERF_RECORD_MISC_EXACT)
 		exact_samples++;
 
@@ -994,7 +1032,7 @@ static void event__process_sample(const 
 		 * --hide-kernel-symbols, even if the user specifies an
 		 * invalid --vmlinux ;-)
 		 */
-		if (al.map == session->vmlinux_maps[MAP__FUNCTION] &&
+		if (al.map == kerninfo->vmlinux_maps[MAP__FUNCTION] &&
 		    RB_EMPTY_ROOT(&al.map->dso->symbols[MAP__FUNCTION])) {
 			pr_err("The %s file can't be used\n",
 			       symbol_conf.vmlinux_name);
@@ -1261,7 +1299,7 @@ static int __cmd_top(void)
 
 	perf_session__mmap_read(session);
 
-	if (pthread_create(&thread, NULL, display_thread, NULL)) {
+	if (pthread_create(&thread, NULL, display_thread, session)) {
 		printf("Could not create display thread.\n");
 		exit(-1);
 	}
diff -Nraup linux-2.6_tip0413/tools/perf/Makefile linux-2.6_tip0413_perfkvm/tools/perf/Makefile
--- linux-2.6_tip0413/tools/perf/Makefile	2010-04-14 11:11:58.802281816 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/Makefile	2010-04-14 11:13:17.313858518 +0800
@@ -472,6 +472,7 @@ BUILTIN_OBJS += $(OUTPUT)builtin-trace.o
 BUILTIN_OBJS += $(OUTPUT)builtin-probe.o
 BUILTIN_OBJS += $(OUTPUT)builtin-kmem.o
 BUILTIN_OBJS += $(OUTPUT)builtin-lock.o
+BUILTIN_OBJS += $(OUTPUT)builtin-kvm.o
 
 PERFLIBS = $(LIB_FILE)
 
diff -Nraup linux-2.6_tip0413/tools/perf/perf.c linux-2.6_tip0413_perfkvm/tools/perf/perf.c
--- linux-2.6_tip0413/tools/perf/perf.c	2010-04-14 11:11:58.478250552 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/perf.c	2010-04-14 11:13:17.313858518 +0800
@@ -307,6 +307,7 @@ static void handle_internal_command(int 
 		{ "probe",	cmd_probe,	0 },
 		{ "kmem",	cmd_kmem,	0 },
 		{ "lock",	cmd_lock,	0 },
+		{ "kvm",	cmd_kvm,	0 },
 	};
 	unsigned int i;
 	static const char ext[] = STRIP_EXTENSION;
diff -Nraup linux-2.6_tip0413/tools/perf/perf.h linux-2.6_tip0413_perfkvm/tools/perf/perf.h
--- linux-2.6_tip0413/tools/perf/perf.h	2010-04-14 11:11:58.810277694 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/perf.h	2010-04-14 11:13:17.313858518 +0800
@@ -131,4 +131,6 @@ struct ip_callchain {
 	u64 ips[0];
 };
 
+extern int perf_host, perf_guest;
+
 #endif
diff -Nraup linux-2.6_tip0413/tools/perf/util/build-id.c linux-2.6_tip0413_perfkvm/tools/perf/util/build-id.c
--- linux-2.6_tip0413/tools/perf/util/build-id.c	2010-04-14 11:11:58.654213263 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/build-id.c	2010-04-14 11:13:17.317861518 +0800
@@ -24,7 +24,7 @@ static int build_id__mark_dso_hit(event_
 	}
 
 	thread__find_addr_map(thread, session, cpumode, MAP__FUNCTION,
-			      event->ip.ip, &al);
+			      event->ip.pid, event->ip.ip, &al);
 
 	if (al.map != NULL)
 		al.map->dso->hit = 1;
diff -Nraup linux-2.6_tip0413/tools/perf/util/event.c linux-2.6_tip0413_perfkvm/tools/perf/util/event.c
--- linux-2.6_tip0413/tools/perf/util/event.c	2010-04-14 11:11:58.662259868 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/event.c	2010-04-14 15:33:50.903104472 +0800
@@ -112,7 +112,11 @@ static int event__synthesize_mmap_events
 		event_t ev = {
 			.header = {
 				.type = PERF_RECORD_MMAP,
-				.misc = 0, /* Just like the kernel, see kernel/perf_event.c __perf_event_mmap */
+				/*
+ 				 * Just like the kernel, see kernel/perf_event.c
+ 				 * __perf_event_mmap
+ 				 */
+				.misc = PERF_RECORD_MISC_USER,
 			 },
 		};
 		int n;
@@ -167,11 +171,23 @@ static int event__synthesize_mmap_events
 }
 
 int event__synthesize_modules(event__handler_t process,
-			      struct perf_session *session)
+			      struct perf_session *session,
+			      struct kernel_info *kerninfo)
 {
 	struct rb_node *nd;
+	struct map_groups *kmaps = &kerninfo->kmaps;
+	u16 misc;
 
-	for (nd = rb_first(&session->kmaps.maps[MAP__FUNCTION]);
+	/*
+	 * kernel uses 0 for user space maps, see kernel/perf_event.c
+	 * __perf_event_mmap
+	 */
+	if (is_host_kernel(kerninfo))
+		misc = PERF_RECORD_MISC_KERNEL;
+	else
+		misc = PERF_RECORD_MISC_GUEST_KERNEL;
+
+	for (nd = rb_first(&kmaps->maps[MAP__FUNCTION]);
 	     nd; nd = rb_next(nd)) {
 		event_t ev;
 		size_t size;
@@ -182,12 +198,13 @@ int event__synthesize_modules(event__han
 
 		size = ALIGN(pos->dso->long_name_len + 1, sizeof(u64));
 		memset(&ev, 0, sizeof(ev));
-		ev.mmap.header.misc = 1; /* kernel uses 0 for user space maps, see kernel/perf_event.c __perf_event_mmap */
+		ev.mmap.header.misc = misc;
 		ev.mmap.header.type = PERF_RECORD_MMAP;
 		ev.mmap.header.size = (sizeof(ev.mmap) -
 				        (sizeof(ev.mmap.filename) - size));
 		ev.mmap.start = pos->start;
 		ev.mmap.len   = pos->end - pos->start;
+		ev.mmap.pid   = kerninfo->pid;
 
 		memcpy(ev.mmap.filename, pos->dso->long_name,
 		       pos->dso->long_name_len + 1);
@@ -250,13 +267,17 @@ static int find_symbol_cb(void *arg, con
 
 int event__synthesize_kernel_mmap(event__handler_t process,
 				  struct perf_session *session,
+				  struct kernel_info *kerninfo,
 				  const char *symbol_name)
 {
 	size_t size;
+	const char *filename, *mmap_name;
+	char path[PATH_MAX];
+	struct map *map;
+
 	event_t ev = {
 		.header = {
 			.type = PERF_RECORD_MMAP,
-			.misc = 1, /* kernel uses 0 for user space maps, see kernel/perf_event.c __perf_event_mmap */
 		},
 	};
 	/*
@@ -266,16 +287,38 @@ int event__synthesize_kernel_mmap(event_
 	 */
 	struct process_symbol_args args = { .name = symbol_name, };
 
-	if (kallsyms__parse("/proc/kallsyms", &args, find_symbol_cb) <= 0)
+	if (is_host_kernel(kerninfo)) {
+		/*
+		 * kernel uses PERF_RECORD_MISC_USER for user space maps,
+		 * see kernel/perf_event.c __perf_event_mmap
+		 */
+		ev.header.misc = PERF_RECORD_MISC_KERNEL;
+		mmap_name = "kernel.kallsyms";
+		filename = "/proc/kallsyms";
+	} else {
+		ev.header.misc = PERF_RECORD_MISC_GUEST_KERNEL;
+		mmap_name = "guest.kernel.kallsyms";
+		if (is_default_guest(kerninfo))
+			filename = (char *) symbol_conf.default_guest_kallsyms;
+		else {
+			sprintf(path, "%s/proc/kallsyms", kerninfo->root_dir);
+			filename = path;
+		}
+	}
+
+	if (kallsyms__parse(filename, &args, find_symbol_cb) <= 0)
 		return -ENOENT;
 
+	map = kerninfo->vmlinux_maps[MAP__FUNCTION];
 	size = snprintf(ev.mmap.filename, sizeof(ev.mmap.filename),
-			"[kernel.kallsyms.%s]", symbol_name) + 1;
+			"[%s.%s]", mmap_name, symbol_name) + 1;
 	size = ALIGN(size, sizeof(u64));
-	ev.mmap.header.size = (sizeof(ev.mmap) - (sizeof(ev.mmap.filename) - size));
+	ev.mmap.header.size = (sizeof(ev.mmap) -
+			(sizeof(ev.mmap.filename) - size));
 	ev.mmap.pgoff = args.start;
-	ev.mmap.start = session->vmlinux_maps[MAP__FUNCTION]->start;
-	ev.mmap.len   = session->vmlinux_maps[MAP__FUNCTION]->end - ev.mmap.start ;
+	ev.mmap.start = map->start;
+	ev.mmap.len   = map->end - ev.mmap.start;
+	ev.mmap.pid   = kerninfo->pid;
 
 	return process(&ev, session);
 }
@@ -329,82 +372,134 @@ int event__process_lost(event_t *self, s
 	return 0;
 }
 
-int event__process_mmap(event_t *self, struct perf_session *session)
+static void event_set_kernel_mmap_len(struct map **maps, event_t *self)
 {
-	struct thread *thread;
-	struct map *map;
-
-	dump_printf(" %d/%d: [%#Lx(%#Lx) @ %#Lx]: %s\n",
-		    self->mmap.pid, self->mmap.tid, self->mmap.start,
-		    self->mmap.len, self->mmap.pgoff, self->mmap.filename);
+	maps[MAP__FUNCTION]->start = self->mmap.start;
+	maps[MAP__FUNCTION]->end   = self->mmap.start + self->mmap.len;
+	/*
+	 * Be a bit paranoid here, some perf.data file came with
+	 * a zero sized synthesized MMAP event for the kernel.
+	 */
+	if (maps[MAP__FUNCTION]->end == 0)
+		maps[MAP__FUNCTION]->end = ~0UL;
+}
 
-	if (self->mmap.pid == 0) {
-		static const char kmmap_prefix[] = "[kernel.kallsyms.";
+static int event__process_kernel_mmap(event_t *self,
+			struct perf_session *session)
+{
+	struct map *map;
+	const char *kmmap_prefix, *short_name;
+	struct kernel_info *kerninfo;
+	enum dso_kernel_type kernel_type;
+
+	kerninfo = kerninfo__findnew(&session->kerninfo_root, self->mmap.pid);
+	if (!kerninfo) {
+		pr_err("Can't find id %d's kerninfo\n", self->mmap.pid);
+		goto out_problem;
+	}
 
-		if (self->mmap.filename[0] == '/') {
-			char short_module_name[1024];
-			char *name = strrchr(self->mmap.filename, '/'), *dot;
-
-			if (name == NULL)
-				goto out_problem;
-
-			++name; /* skip / */
-			dot = strrchr(name, '.');
-			if (dot == NULL)
-				goto out_problem;
-
-			snprintf(short_module_name, sizeof(short_module_name),
-				 "[%.*s]", (int)(dot - name), name);
-			strxfrchar(short_module_name, '-', '_');
-
-			map = perf_session__new_module_map(session,
-							   self->mmap.start,
-							   self->mmap.filename);
-			if (map == NULL)
-				goto out_problem;
-
-			name = strdup(short_module_name);
-			if (name == NULL)
-				goto out_problem;
-
-			map->dso->short_name = name;
-			map->end = map->start + self->mmap.len;
-		} else if (memcmp(self->mmap.filename, kmmap_prefix,
+	if (is_host_kernel(kerninfo)) {
+		kmmap_prefix = "[kernel.kallsyms.";
+		short_name = "[kernel.kallsyms]";
+		kernel_type = DSO_TYPE_KERNEL;
+	} else {
+		kmmap_prefix = "[guest.kernel.kallsyms.";
+		short_name = "[guest.kernel.kallsyms]";
+		kernel_type = DSO_TYPE_GUEST_KERNEL;
+	}
+
+	if (self->mmap.filename[0] == '/') {
+
+		char short_module_name[1024];
+		char *name = strrchr(self->mmap.filename, '/'), *dot;
+
+		if (name == NULL)
+			goto out_problem;
+
+		++name; /* skip / */
+		dot = strrchr(name, '.');
+		if (dot == NULL)
+			goto out_problem;
+
+		snprintf(short_module_name, sizeof(short_module_name),
+				"[%.*s]", (int)(dot - name), name);
+		strxfrchar(short_module_name, '-', '_');
+
+		map = map_groups__new_module(&kerninfo->kmaps,
+				self->mmap.start,
+				self->mmap.filename,
+				kerninfo);
+		if (map == NULL)
+			goto out_problem;
+
+		name = strdup(short_module_name);
+		if (name == NULL)
+			goto out_problem;
+
+		map->dso->short_name = name;
+		map->end = map->start + self->mmap.len;
+	} else if (memcmp(self->mmap.filename, kmmap_prefix,
 				sizeof(kmmap_prefix) - 1) == 0) {
-			const char *symbol_name = (self->mmap.filename +
-						   sizeof(kmmap_prefix) - 1);
+		const char *symbol_name = (self->mmap.filename +
+				sizeof(kmmap_prefix) - 1);
+		/*
+		 * Should be there already, from the build-id table in
+		 * the header.
+		 */
+		struct dso *kernel = __dsos__findnew(&kerninfo->dsos__kernel,
+				short_name);
+		if (kernel == NULL)
+			goto out_problem;
+
+		kernel->kernel = kernel_type;
+		if (__map_groups__create_kernel_maps(&kerninfo->kmaps,
+					kerninfo->vmlinux_maps, kernel) < 0)
+			goto out_problem;
+
+		event_set_kernel_mmap_len(kerninfo->vmlinux_maps, self);
+		perf_session__set_kallsyms_ref_reloc_sym(kerninfo->vmlinux_maps,
+				symbol_name,
+				self->mmap.pgoff);
+		if (is_default_guest(kerninfo)) {
 			/*
-			 * Should be there already, from the build-id table in
-			 * the header.
+			 * preload dso of guest kernel and modules
 			 */
-			struct dso *kernel = __dsos__findnew(&dsos__kernel,
-							     "[kernel.kallsyms]");
-			if (kernel == NULL)
-				goto out_problem;
-
-			kernel->kernel = 1;
-			if (__perf_session__create_kernel_maps(session, kernel) < 0)
-				goto out_problem;
+			dso__load(kernel,
+				kerninfo->vmlinux_maps[MAP__FUNCTION],
+				NULL);
+		}
+	}
+	return 0;
+out_problem:
+	return -1;
+}
 
-			session->vmlinux_maps[MAP__FUNCTION]->start = self->mmap.start;
-			session->vmlinux_maps[MAP__FUNCTION]->end   = self->mmap.start + self->mmap.len;
-			/*
-			 * Be a bit paranoid here, some perf.data file came with
-			 * a zero sized synthesized MMAP event for the kernel.
-			 */
-			if (session->vmlinux_maps[MAP__FUNCTION]->end == 0)
-				session->vmlinux_maps[MAP__FUNCTION]->end = ~0UL;
+int event__process_mmap(event_t *self, struct perf_session *session)
+{
+	struct kernel_info *kerninfo;
+	struct thread *thread;
+	struct map *map;
+	u8 cpumode = self->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+	int ret = 0;
 
-			perf_session__set_kallsyms_ref_reloc_sym(session, symbol_name,
-								 self->mmap.pgoff);
-		}
+	dump_printf(" %d/%d: [%#Lx(%#Lx) @ %#Lx]: %s\n",
+			self->mmap.pid, self->mmap.tid, self->mmap.start,
+			self->mmap.len, self->mmap.pgoff, self->mmap.filename);
+
+	if (cpumode == PERF_RECORD_MISC_GUEST_KERNEL ||
+	    cpumode == PERF_RECORD_MISC_KERNEL) {
+		ret = event__process_kernel_mmap(self, session);
+		if (ret < 0)
+			goto out_problem;
 		return 0;
 	}
 
 	thread = perf_session__findnew(session, self->mmap.pid);
-	map = map__new(self->mmap.start, self->mmap.len, self->mmap.pgoff,
-		       self->mmap.pid, self->mmap.filename, MAP__FUNCTION,
-		       session->cwd, session->cwdlen);
+	kerninfo = kerninfo__findhost(&session->kerninfo_root);
+	map = map__new(&kerninfo->dsos__user, self->mmap.start,
+			self->mmap.len, self->mmap.pgoff,
+			self->mmap.pid, self->mmap.filename,
+			MAP__FUNCTION, session->cwd, session->cwdlen);
 
 	if (thread == NULL || map == NULL)
 		goto out_problem;
@@ -444,22 +539,52 @@ int event__process_task(event_t *self, s
 
 void thread__find_addr_map(struct thread *self,
 			   struct perf_session *session, u8 cpumode,
-			   enum map_type type, u64 addr,
+			   enum map_type type, pid_t pid, u64 addr,
 			   struct addr_location *al)
 {
 	struct map_groups *mg = &self->mg;
+	struct kernel_info *kerninfo = NULL;
 
 	al->thread = self;
 	al->addr = addr;
+	al->cpumode = cpumode;
+	al->filtered = false;
 
-	if (cpumode == PERF_RECORD_MISC_KERNEL) {
+	if (cpumode == PERF_RECORD_MISC_KERNEL && perf_host) {
 		al->level = 'k';
-		mg = &session->kmaps;
-	} else if (cpumode == PERF_RECORD_MISC_USER)
+		kerninfo = kerninfo__findhost(&session->kerninfo_root);
+		mg = &kerninfo->kmaps;
+	} else if (cpumode == PERF_RECORD_MISC_USER && perf_host) {
 		al->level = '.';
-	else {
-		al->level = 'H';
+		kerninfo = kerninfo__findhost(&session->kerninfo_root);
+	} else if (cpumode == PERF_RECORD_MISC_GUEST_KERNEL && perf_guest) {
+		al->level = 'g';
+		kerninfo = kerninfo__find(&session->kerninfo_root, pid);
+		if (!kerninfo) {
+			al->map = NULL;
+			return;
+		}
+		mg = &kerninfo->kmaps;
+	} else {
+		/*
+		 * 'u' means guest os user space.
+		 * TODO: We don't support guest user space. Might support late.
+		 */
+		if (cpumode == PERF_RECORD_MISC_GUEST_USER && perf_guest)
+			al->level = 'u';
+		else
+			al->level = 'H';
 		al->map = NULL;
+
+		if ((cpumode == PERF_RECORD_MISC_GUEST_USER ||
+			cpumode == PERF_RECORD_MISC_GUEST_KERNEL) &&
+			!perf_guest)
+			al->filtered = true;
+		if ((cpumode == PERF_RECORD_MISC_USER ||
+			cpumode == PERF_RECORD_MISC_KERNEL) &&
+			!perf_host)
+			al->filtered = true;
+
 		return;
 	}
 try_again:
@@ -474,8 +599,11 @@ try_again:
 		 * "[vdso]" dso, but for now lets use the old trick of looking
 		 * in the whole kernel symbol list.
 		 */
-		if ((long long)al->addr < 0 && mg != &session->kmaps) {
-			mg = &session->kmaps;
+		if ((long long)al->addr < 0 &&
+			cpumode == PERF_RECORD_MISC_KERNEL &&
+			kerninfo &&
+			mg != &kerninfo->kmaps)  {
+			mg = &kerninfo->kmaps;
 			goto try_again;
 		}
 	} else
@@ -484,11 +612,11 @@ try_again:
 
 void thread__find_addr_location(struct thread *self,
 				struct perf_session *session, u8 cpumode,
-				enum map_type type, u64 addr,
+				enum map_type type, pid_t pid, u64 addr,
 				struct addr_location *al,
 				symbol_filter_t filter)
 {
-	thread__find_addr_map(self, session, cpumode, type, addr, al);
+	thread__find_addr_map(self, session, cpumode, type, pid, addr, al);
 	if (al->map != NULL)
 		al->sym = map__find_symbol(al->map, al->addr, filter);
 	else
@@ -524,7 +652,7 @@ int event__preprocess_sample(const event
 	dump_printf(" ... thread: %s:%d\n", thread->comm, thread->pid);
 
 	thread__find_addr_map(thread, session, cpumode, MAP__FUNCTION,
-			      self->ip.ip, al);
+			      self->ip.pid, self->ip.ip, al);
 	dump_printf(" ...... dso: %s\n",
 		    al->map ? al->map->dso->long_name :
 			al->level == 'H' ? "[hypervisor]" : "<not found>");
@@ -554,7 +682,6 @@ int event__preprocess_sample(const event
 	    !strlist__has_entry(symbol_conf.sym_list, al->sym->name))
 		goto out_filtered;
 
-	al->filtered = false;
 	return 0;
 
 out_filtered:
diff -Nraup linux-2.6_tip0413/tools/perf/util/event.h linux-2.6_tip0413_perfkvm/tools/perf/util/event.h
--- linux-2.6_tip0413/tools/perf/util/event.h	2010-04-14 11:11:58.638239002 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/event.h	2010-04-14 14:12:02.533688079 +0800
@@ -79,6 +79,7 @@ struct sample_data {
 
 struct build_id_event {
 	struct perf_event_header header;
+	pid_t			 pid;
 	u8			 build_id[ALIGN(BUILD_ID_SIZE, sizeof(u64))];
 	char			 filename[];
 };
@@ -119,10 +120,13 @@ int event__synthesize_thread(pid_t pid, 
 void event__synthesize_threads(event__handler_t process,
 			       struct perf_session *session);
 int event__synthesize_kernel_mmap(event__handler_t process,
-				  struct perf_session *session,
-				  const char *symbol_name);
+				struct perf_session *session,
+				struct kernel_info *kerninfo,
+				const char *symbol_name);
+
 int event__synthesize_modules(event__handler_t process,
-			      struct perf_session *session);
+			      struct perf_session *session,
+			      struct kernel_info *kerninfo);
 
 int event__process_comm(event_t *self, struct perf_session *session);
 int event__process_lost(event_t *self, struct perf_session *session);
diff -Nraup linux-2.6_tip0413/tools/perf/util/header.c linux-2.6_tip0413_perfkvm/tools/perf/util/header.c
--- linux-2.6_tip0413/tools/perf/util/header.c	2010-04-14 11:11:58.594236160 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/header.c	2010-04-14 11:13:17.317861518 +0800
@@ -197,7 +197,8 @@ static int write_padded(int fd, const vo
 			continue;		\
 		else
 
-static int __dsos__write_buildid_table(struct list_head *head, u16 misc, int fd)
+static int __dsos__write_buildid_table(struct list_head *head, pid_t pid,
+				u16 misc, int fd)
 {
 	struct dso *pos;
 
@@ -212,6 +213,7 @@ static int __dsos__write_buildid_table(s
 		len = ALIGN(len, NAME_ALIGN);
 		memset(&b, 0, sizeof(b));
 		memcpy(&b.build_id, pos->build_id, sizeof(pos->build_id));
+		b.pid = pid;
 		b.header.misc = misc;
 		b.header.size = sizeof(b) + len;
 		err = do_write(fd, &b, sizeof(b));
@@ -226,13 +228,33 @@ static int __dsos__write_buildid_table(s
 	return 0;
 }
 
-static int dsos__write_buildid_table(int fd)
+static int dsos__write_buildid_table(struct perf_header *header, int fd)
 {
-	int err = __dsos__write_buildid_table(&dsos__kernel,
-					      PERF_RECORD_MISC_KERNEL, fd);
-	if (err == 0)
-		err = __dsos__write_buildid_table(&dsos__user,
-						  PERF_RECORD_MISC_USER, fd);
+	struct perf_session *session = container_of(header,
+			struct perf_session, header);
+	struct rb_node *nd;
+	int err = 0;
+	u16 kmisc, umisc;
+
+	for (nd = rb_first(&session->kerninfo_root); nd; nd = rb_next(nd)) {
+		struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+				rb_node);
+		if (is_host_kernel(pos)) {
+			kmisc = PERF_RECORD_MISC_KERNEL;
+			umisc = PERF_RECORD_MISC_USER;
+		} else {
+			kmisc = PERF_RECORD_MISC_GUEST_KERNEL;
+			umisc = PERF_RECORD_MISC_GUEST_USER;
+		}
+
+		err = __dsos__write_buildid_table(&pos->dsos__kernel, pos->pid,
+				kmisc, fd);
+		if (err == 0)
+			err = __dsos__write_buildid_table(&pos->dsos__user,
+				pos->pid, umisc, fd);
+		if (err)
+			break;
+	}
 	return err;
 }
 
@@ -349,9 +371,12 @@ static int __dsos__cache_build_ids(struc
 	return err;
 }
 
-static int dsos__cache_build_ids(void)
+static int dsos__cache_build_ids(struct perf_header *self)
 {
-	int err_kernel, err_user;
+	struct perf_session *session = container_of(self,
+			struct perf_session, header);
+	struct rb_node *nd;
+	int ret = 0;
 	char debugdir[PATH_MAX];
 
 	snprintf(debugdir, sizeof(debugdir), "%s/%s", getenv("HOME"),
@@ -360,9 +385,30 @@ static int dsos__cache_build_ids(void)
 	if (mkdir(debugdir, 0755) != 0 && errno != EEXIST)
 		return -1;
 
-	err_kernel = __dsos__cache_build_ids(&dsos__kernel, debugdir);
-	err_user   = __dsos__cache_build_ids(&dsos__user, debugdir);
-	return err_kernel || err_user ? -1 : 0;
+	for (nd = rb_first(&session->kerninfo_root); nd; nd = rb_next(nd)) {
+		struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+				rb_node);
+		ret |= __dsos__cache_build_ids(&pos->dsos__kernel, debugdir);
+		ret |= __dsos__cache_build_ids(&pos->dsos__user, debugdir);
+	}
+	return ret ? -1 : 0;
+}
+
+static bool dsos__read_build_ids(struct perf_header *self, bool with_hits)
+{
+	bool ret = false;
+	struct perf_session *session = container_of(self,
+			struct perf_session, header);
+	struct rb_node *nd;
+
+	for (nd = rb_first(&session->kerninfo_root); nd; nd = rb_next(nd)) {
+		struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+				rb_node);
+		ret |= __dsos__read_build_ids(&pos->dsos__kernel, with_hits);
+		ret |= __dsos__read_build_ids(&pos->dsos__user, with_hits);
+	}
+
+	return ret;
 }
 
 static int perf_header__adds_write(struct perf_header *self, int fd)
@@ -373,7 +419,7 @@ static int perf_header__adds_write(struc
 	u64 sec_start;
 	int idx = 0, err;
 
-	if (dsos__read_build_ids(true))
+	if (dsos__read_build_ids(self, true))
 		perf_header__set_feat(self, HEADER_BUILD_ID);
 
 	nr_sections = bitmap_weight(self->adds_features, HEADER_FEAT_BITS);
@@ -408,14 +454,14 @@ static int perf_header__adds_write(struc
 
 		/* Write build-ids */
 		buildid_sec->offset = lseek(fd, 0, SEEK_CUR);
-		err = dsos__write_buildid_table(fd);
+		err = dsos__write_buildid_table(self, fd);
 		if (err < 0) {
 			pr_debug("failed to write buildid table\n");
 			goto out_free;
 		}
 		buildid_sec->size = lseek(fd, 0, SEEK_CUR) -
 					  buildid_sec->offset;
-		dsos__cache_build_ids();
+		dsos__cache_build_ids(self);
 	}
 
 	lseek(fd, sec_start, SEEK_SET);
@@ -636,6 +682,72 @@ int perf_file_header__read(struct perf_f
 	return 0;
 }
 
+static int perf_header__read_build_ids(struct perf_header *self,
+			int input, u64 offset, u64 size)
+{
+	struct perf_session *session = container_of(self,
+			struct perf_session, header);
+	struct build_id_event bev;
+	char filename[PATH_MAX];
+	u64 limit = offset + size;
+	int err = -1;
+	struct list_head *head;
+	struct kernel_info *kerninfo;
+	u16 misc;
+ 
+	while (offset < limit) {
+		struct dso *dso;
+		ssize_t len;
+		enum dso_kernel_type dso_type;
+
+		if (read(input, &bev, sizeof(bev)) != sizeof(bev))
+			goto out;
+
+		kerninfo = kerninfo__findnew(&session->kerninfo_root, bev.pid);
+		if (!kerninfo)
+			goto out;
+
+		if (self->needs_swap)
+			perf_event_header__bswap(&bev.header);
+
+		len = bev.header.size - sizeof(bev);
+		if (read(input, filename, len) != len)
+			goto out;
+
+		misc = bev.header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+
+		switch(misc) {
+		case PERF_RECORD_MISC_KERNEL:
+			dso_type = DSO_TYPE_KERNEL;
+			head = &kerninfo->dsos__kernel;
+			break;
+		case PERF_RECORD_MISC_GUEST_KERNEL:
+			dso_type = DSO_TYPE_GUEST_KERNEL;
+			head = &kerninfo->dsos__kernel;
+			break;
+		case PERF_RECORD_MISC_USER:
+		case PERF_RECORD_MISC_GUEST_USER:
+			dso_type = DSO_TYPE_USER;
+			head = &kerninfo->dsos__user;
+			break;
+		default:
+			goto out;
+		}
+
+		dso = __dsos__findnew(head, filename);
+		if (dso != NULL) {
+			dso__set_build_id(dso, &bev.build_id);
+			if (filename[0] == '[')
+				dso->kernel = dso_type;
+		}
+
+		offset += bev.header.size;
+	}
+	err = 0;
+out:
+	return err;
+}
+
 static int perf_file_section__process(struct perf_file_section *self,
 				      struct perf_header *ph,
 				      int feat, int fd)
diff -Nraup linux-2.6_tip0413/tools/perf/util/hist.c linux-2.6_tip0413_perfkvm/tools/perf/util/hist.c
--- linux-2.6_tip0413/tools/perf/util/hist.c	2010-04-14 11:11:58.766255670 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/hist.c	2010-04-14 16:02:22.299845756 +0800
@@ -8,6 +8,30 @@ struct callchain_param	callchain_param =
 	.min_percent = 0.5
 };
 
+void __perf_session__add_count(struct hist_entry *he,
+			struct addr_location *al,
+			u64 count)
+{
+	he->count += count;
+
+	switch (al->cpumode) {
+	case PERF_RECORD_MISC_KERNEL:
+		he->count_sys += count;
+		break;
+	case PERF_RECORD_MISC_USER:
+		he->count_us += count;
+		break;
+	case PERF_RECORD_MISC_GUEST_KERNEL:
+		he->count_guest_sys += count;
+		break;
+	case PERF_RECORD_MISC_GUEST_USER:
+		he->count_guest_us += count;
+		break;
+	default:
+		break;
+	}
+}
+
 /*
  * histogram, sorted on item, collects counts
  */
@@ -464,7 +488,7 @@ int hist_entry__snprintf(struct hist_ent
 			   u64 session_total)
 {
 	struct sort_entry *se;
-	u64 count, total;
+	u64 count, total, count_sys, count_us, count_guest_sys, count_guest_us;
 	const char *sep = symbol_conf.field_sep;
 	int ret;
 
@@ -474,9 +498,17 @@ int hist_entry__snprintf(struct hist_ent
 	if (pair_session) {
 		count = self->pair ? self->pair->count : 0;
 		total = pair_session->events_stats.total;
+		count_sys = self->pair ? self->pair->count_sys : 0;
+		count_us = self->pair ? self->pair->count_us : 0;
+		count_guest_sys = self->pair ? self->pair->count_guest_sys : 0;
+		count_guest_us = self->pair ? self->pair->count_guest_us : 0;
 	} else {
 		count = self->count;
 		total = session_total;
+		count_sys = self->count_sys;
+		count_us = self->count_us;
+		count_guest_sys = self->count_guest_sys;
+		count_guest_us = self->count_guest_us;
 	}
 
 	if (total) {
@@ -487,6 +519,22 @@ int hist_entry__snprintf(struct hist_ent
 		else
 			ret = snprintf(s, size, sep ? "%.2f" : "   %6.2f%%",
 				       (count * 100.0) / total);
+		if (symbol_conf.show_cpu_utilization) {
+			ret += percent_color_snprintf(s + ret, size - ret,
+					sep ? "%.2f" : "   %6.2f%%",
+					(count_sys * 100.0) / total);
+			ret += percent_color_snprintf(s + ret, size - ret,
+					sep ? "%.2f" : "   %6.2f%%",
+					(count_us * 100.0) / total);
+			if (perf_guest) {
+				ret += percent_color_snprintf(s + ret, size - ret,
+						sep ? "%.2f" : "   %6.2f%%",
+						(count_guest_sys * 100.0) / total);
+				ret += percent_color_snprintf(s + ret, size - ret,
+						sep ? "%.2f" : "   %6.2f%%",
+						(count_guest_us * 100.0) / total);
+			}
+		}
 	} else
 		ret = snprintf(s, size, sep ? "%lld" : "%12lld ", count);
 
@@ -597,6 +645,24 @@ size_t perf_session__fprintf_hists(struc
 			fputs("  Samples  ", fp);
 	}
 
+	if (symbol_conf.show_cpu_utilization) {
+		if (sep) {
+			ret += fprintf(fp, "%csys", *sep);
+			ret += fprintf(fp, "%cus", *sep);
+			if (perf_guest) {
+				ret += fprintf(fp, "%cguest sys", *sep);
+				ret += fprintf(fp, "%cguest us", *sep);
+			}
+		} else {
+			ret += fprintf(fp, "  sys  ");
+			ret += fprintf(fp, "  us  ");
+			if (perf_guest) {
+				ret += fprintf(fp, "  guest sys  ");
+				ret += fprintf(fp, "  guest us  ");
+			}
+		}
+	}
+
 	if (pair) {
 		if (sep)
 			ret += fprintf(fp, "%cDelta", *sep);
diff -Nraup linux-2.6_tip0413/tools/perf/util/hist.h linux-2.6_tip0413_perfkvm/tools/perf/util/hist.h
--- linux-2.6_tip0413/tools/perf/util/hist.h	2010-04-14 11:11:58.674215806 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/hist.h	2010-04-14 11:13:17.317861518 +0800
@@ -12,6 +12,9 @@ struct addr_location;
 struct symbol;
 struct rb_root;
 
+void __perf_session__add_count(struct hist_entry *he,
+			struct addr_location *al,
+			u64 count);
 struct hist_entry *__perf_session__add_hist_entry(struct rb_root *hists,
 						  struct addr_location *al,
 						  struct symbol *parent,
diff -Nraup linux-2.6_tip0413/tools/perf/util/map.c linux-2.6_tip0413_perfkvm/tools/perf/util/map.c
--- linux-2.6_tip0413/tools/perf/util/map.c	2010-04-14 11:11:58.642241284 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/map.c	2010-04-14 16:08:55.377366557 +0800
@@ -4,6 +4,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <stdio.h>
+#include <unistd.h>
 #include "map.h"
 
 const char *map_type__name[MAP__NR_TYPES] = {
@@ -37,9 +38,11 @@ void map__init(struct map *self, enum ma
 	self->map_ip   = map__map_ip;
 	self->unmap_ip = map__unmap_ip;
 	RB_CLEAR_NODE(&self->rb_node);
+	self->groups   = NULL;
 }
 
-struct map *map__new(u64 start, u64 len, u64 pgoff, u32 pid, char *filename,
+struct map *map__new(struct list_head *dsos__list, u64 start, u64 len,
+		     u64 pgoff, u32 pid, char *filename,
 		     enum map_type type, char *cwd, int cwdlen)
 {
 	struct map *self = malloc(sizeof(*self));
@@ -66,7 +69,7 @@ struct map *map__new(u64 start, u64 len,
 			filename = newfilename;
 		}
 
-		dso = dsos__findnew(filename);
+		dso = __dsos__findnew(dsos__list, filename);
 		if (dso == NULL)
 			goto out_delete;
 
@@ -242,6 +245,7 @@ void map_groups__init(struct map_groups 
 		self->maps[i] = RB_ROOT;
 		INIT_LIST_HEAD(&self->removed_maps[i]);
 	}
+	 self->this_kerninfo = NULL;
 }
 
 void map_groups__flush(struct map_groups *self)
@@ -508,3 +512,123 @@ struct map *maps__find(struct rb_root *m
 
 	return NULL;
 }
+
+struct kernel_info * add_new_kernel_info(struct rb_root *kerninfo_root,
+			pid_t pid, const char * root_dir)
+{
+	struct rb_node **p = &kerninfo_root->rb_node;
+	struct rb_node *parent = NULL;
+	struct kernel_info *kerninfo, *pos;
+
+	kerninfo = malloc(sizeof(struct kernel_info));
+	if (!kerninfo)
+		return NULL;
+
+	kerninfo->pid = pid;
+	map_groups__init(&kerninfo->kmaps);
+	kerninfo->root_dir = strdup(root_dir);
+	RB_CLEAR_NODE(&kerninfo->rb_node);
+	INIT_LIST_HEAD(&kerninfo->dsos__user);
+	INIT_LIST_HEAD(&kerninfo->dsos__kernel);
+	kerninfo->kmaps.this_kerninfo = kerninfo;
+
+	while (*p != NULL) {
+		parent = *p;
+		pos = rb_entry(parent, struct kernel_info, rb_node);
+		if (pid < pos->pid)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+
+	rb_link_node(&kerninfo->rb_node, parent, p);
+	rb_insert_color(&kerninfo->rb_node, kerninfo_root);
+
+	return kerninfo;
+}
+
+struct kernel_info *kerninfo__find(struct rb_root *kerninfo_root, pid_t pid)
+{
+	struct rb_node **p = &kerninfo_root->rb_node;
+	struct rb_node *parent = NULL;
+	struct kernel_info *kerninfo;
+	struct kernel_info *default_kerninfo = NULL;
+
+	while (*p != NULL) {
+		parent = *p;
+		kerninfo = rb_entry(parent, struct kernel_info, rb_node);
+		if (pid < kerninfo->pid)
+			p = &(*p)->rb_left;
+		else if (pid > kerninfo->pid)
+			p = &(*p)->rb_right;
+		else
+			return kerninfo;
+		if (!kerninfo->pid)
+			default_kerninfo = kerninfo;
+	}
+
+	return default_kerninfo;
+}
+
+struct kernel_info *kerninfo__findhost(struct rb_root *kerninfo_root)
+{
+	struct rb_node **p = &kerninfo_root->rb_node;
+	struct rb_node *parent = NULL;
+	struct kernel_info *kerninfo;
+	pid_t pid = HOST_KERNEL_ID;
+
+	while (*p != NULL) {
+		parent = *p;
+		kerninfo = rb_entry(parent, struct kernel_info, rb_node);
+		if (pid < kerninfo->pid)
+			p = &(*p)->rb_left;
+		else if (pid > kerninfo->pid)
+			p = &(*p)->rb_right;
+		else
+			return kerninfo;
+	}
+
+	return NULL;
+}
+
+struct kernel_info *kerninfo__findnew(struct rb_root *kerninfo_root, pid_t pid)
+{
+	char path[PATH_MAX];
+	const char * root_dir;
+	int ret;
+	struct kernel_info *kerninfo = kerninfo__find(kerninfo_root, pid);
+
+	if (!kerninfo || kerninfo->pid != pid) {
+		if (pid == HOST_KERNEL_ID || pid == DEFAULT_GUEST_KERNEL_ID)
+			root_dir = "";
+		else {
+			if (!symbol_conf.guestmount)
+				goto out;
+			sprintf(path, "%s/%d", symbol_conf.guestmount, pid);
+			ret = access(path, R_OK);
+			if (ret) {
+				pr_err("Can't access file %s\n", path);
+				goto out;
+			}
+			root_dir = path;
+		}
+		kerninfo = add_new_kernel_info(kerninfo_root, pid, root_dir);
+	}
+
+out:
+	return kerninfo;
+}
+
+void kerninfo__process_allkernels(struct rb_root *kerninfo_root,
+		process_kernel_info process,
+		void * data)
+{
+	struct rb_node *nd;
+
+	for (nd = rb_first(kerninfo_root); nd; nd = rb_next(nd)) {
+		struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+							rb_node);
+		process(pos, data);
+	}
+}
+
diff -Nraup linux-2.6_tip0413/tools/perf/util/map.h linux-2.6_tip0413_perfkvm/tools/perf/util/map.h
--- linux-2.6_tip0413/tools/perf/util/map.h	2010-04-14 11:11:58.686216105 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/map.h	2010-04-14 16:12:24.683245583 +0800
@@ -19,6 +19,7 @@ extern const char *map_type__name[MAP__N
 struct dso;
 struct ref_reloc_sym;
 struct map_groups;
+struct kernel_info;
 
 struct map {
 	union {
@@ -36,6 +37,7 @@ struct map {
 	u64			(*unmap_ip)(struct map *, u64);
 
 	struct dso		*dso;
+	struct map_groups	*groups;
 };
 
 struct kmap {
@@ -43,6 +45,26 @@ struct kmap {
 	struct map_groups	*kmaps;
 };
 
+struct map_groups {
+	struct rb_root		maps[MAP__NR_TYPES];
+	struct list_head	removed_maps[MAP__NR_TYPES];
+	struct kernel_info	*this_kerninfo;
+};
+
+/* Native host kernel uses -1 as pid index in kernel_info */
+#define	HOST_KERNEL_ID			(-1)
+#define	DEFAULT_GUEST_KERNEL_ID		(0)
+
+struct kernel_info {
+	struct rb_node rb_node;
+	pid_t pid;
+	char * root_dir;
+	struct list_head dsos__user;
+	struct list_head dsos__kernel;
+	struct map_groups kmaps;
+	struct map *vmlinux_maps[MAP__NR_TYPES];
+};
+
 static inline struct kmap *map__kmap(struct map *self)
 {
 	return (struct kmap *)(self + 1);
@@ -74,7 +96,8 @@ typedef int (*symbol_filter_t)(struct ma
 
 void map__init(struct map *self, enum map_type type,
 	       u64 start, u64 end, u64 pgoff, struct dso *dso);
-struct map *map__new(u64 start, u64 len, u64 pgoff, u32 pid, char *filename,
+struct map *map__new(struct list_head *dsos__list, u64 start, u64 len,
+		     u64 pgoff, u32 pid, char *filename,
 		     enum map_type type, char *cwd, int cwdlen);
 void map__delete(struct map *self);
 struct map *map__clone(struct map *self);
@@ -91,11 +114,6 @@ void map__fixup_end(struct map *self);
 
 void map__reloc_vmlinux(struct map *self);
 
-struct map_groups {
-	struct rb_root		maps[MAP__NR_TYPES];
-	struct list_head	removed_maps[MAP__NR_TYPES];
-};
-
 size_t __map_groups__fprintf_maps(struct map_groups *self,
 				  enum map_type type, int verbose, FILE *fp);
 void maps__insert(struct rb_root *maps, struct map *map);
@@ -106,9 +124,39 @@ int map_groups__clone(struct map_groups 
 size_t map_groups__fprintf(struct map_groups *self, int verbose, FILE *fp);
 size_t map_groups__fprintf_maps(struct map_groups *self, int verbose, FILE *fp);
 
+struct kernel_info * add_new_kernel_info(struct rb_root *kerninfo_root,
+			pid_t pid, const char * root_dir);
+struct kernel_info *kerninfo__find(struct rb_root *kerninfo_root, pid_t pid);
+struct kernel_info *kerninfo__findnew(struct rb_root *kerninfo_root, pid_t pid);
+struct kernel_info *kerninfo__findhost(struct rb_root *kerninfo_root);
+
+/*
+ * Default guest kernel is defined by parameter --guestkallsyms
+ * and --guestmodules
+ */
+static inline int is_default_guest(struct kernel_info * kerninfo)
+{
+	if (!kerninfo)
+		return 0;
+	return kerninfo->pid == DEFAULT_GUEST_KERNEL_ID;
+}
+
+static inline int is_host_kernel(struct kernel_info * kerninfo)
+{
+	if (!kerninfo)
+		return 0;
+	return kerninfo->pid == HOST_KERNEL_ID;
+}
+
+typedef void (*process_kernel_info)(struct kernel_info *kerninfo, void *data);
+void kerninfo__process_allkernels(struct rb_root *kerninfo_root,
+		process_kernel_info process,
+		void * data);
+
 static inline void map_groups__insert(struct map_groups *self, struct map *map)
 {
-	 maps__insert(&self->maps[map->type], map);
+	maps__insert(&self->maps[map->type], map);
+	map->groups = self;
 }
 
 static inline struct map *map_groups__find(struct map_groups *self,
@@ -148,13 +196,11 @@ int map_groups__fixup_overlappings(struc
 
 struct map *map_groups__find_by_name(struct map_groups *self,
 				     enum map_type type, const char *name);
-int __map_groups__create_kernel_maps(struct map_groups *self,
-				     struct map *vmlinux_maps[MAP__NR_TYPES],
-				     struct dso *kernel);
-int map_groups__create_kernel_maps(struct map_groups *self,
-				   struct map *vmlinux_maps[MAP__NR_TYPES]);
-struct map *map_groups__new_module(struct map_groups *self, u64 start,
-				   const char *filename);
+struct map *map_groups__new_module(struct map_groups *self,
+				    u64 start,
+				    const char *filename,
+				    struct kernel_info *kerninfo);
+
 void map_groups__flush(struct map_groups *self);
 
 #endif /* __PERF_MAP_H */
diff -Nraup linux-2.6_tip0413/tools/perf/util/probe-event.c linux-2.6_tip0413_perfkvm/tools/perf/util/probe-event.c
--- linux-2.6_tip0413/tools/perf/util/probe-event.c	2010-04-14 11:11:58.614279111 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/probe-event.c	2010-04-14 11:13:17.321860837 +0800
@@ -78,6 +78,8 @@ static struct map *kmaps[MAP__NR_TYPES];
 /* Initialize symbol maps and path of vmlinux */
 static void init_vmlinux(void)
 {
+	struct dso *kernel;
+
 	symbol_conf.sort_by_name = true;
 	if (symbol_conf.vmlinux_name == NULL)
 		symbol_conf.try_vmlinux_path = true;
@@ -86,8 +88,12 @@ static void init_vmlinux(void)
 	if (symbol__init() < 0)
 		die("Failed to init symbol map.");
 
+	kernel = dso__new_kernel(symbol_conf.vmlinux_name);
+	if (kernel == NULL)
+		die("Failed to create kernel dso.");
+
 	map_groups__init(&kmap_groups);
-	if (map_groups__create_kernel_maps(&kmap_groups, kmaps) < 0)
+	if (__map_groups__create_kernel_maps(&kmap_groups, kmaps, kernel) < 0)
 		die("Failed to create kernel maps.");
 }
 
diff -Nraup linux-2.6_tip0413/tools/perf/util/session.c linux-2.6_tip0413_perfkvm/tools/perf/util/session.c
--- linux-2.6_tip0413/tools/perf/util/session.c	2010-04-14 11:11:58.794254600 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/session.c	2010-04-14 16:15:56.564948860 +0800
@@ -52,6 +52,17 @@ out_close:
 	return -1;
 }
 
+int perf_session__create_kernel_maps(struct perf_session *self)
+{
+	int ret;
+	struct rb_root *root = &self->kerninfo_root;
+
+	ret = map_groups__create_kernel_maps(root, HOST_KERNEL_ID);
+	if (ret >= 0)
+		ret = map_groups__create_guest_kernel_maps(root);
+	return ret;
+}
+
 struct perf_session *perf_session__new(const char *filename, int mode, bool force)
 {
 	size_t len = filename ? strlen(filename) + 1 : 0;
@@ -71,7 +82,7 @@ struct perf_session *perf_session__new(c
 	self->cwd = NULL;
 	self->cwdlen = 0;
 	self->unknown_events = 0;
-	map_groups__init(&self->kmaps);
+	self->kerninfo_root = RB_ROOT;
 
 	if (mode == O_RDONLY) {
 		if (perf_session__open(self, force) < 0)
@@ -142,8 +153,9 @@ struct map_symbol *perf_session__resolve
 			continue;
 		}
 
+		al.filtered = false;
 		thread__find_addr_location(thread, self, cpumode,
-					   MAP__FUNCTION, ip, &al, NULL);
+				MAP__FUNCTION, thread->pid, ip, &al, NULL);
 		if (al.sym != NULL) {
 			if (sort__has_parent && !*parent &&
 			    symbol__match_parent_regex(al.sym))
@@ -324,46 +336,6 @@ void perf_event_header__bswap(struct per
 	self->size = bswap_16(self->size);
 }
 
-int perf_header__read_build_ids(struct perf_header *self,
-				int input, u64 offset, u64 size)
-{
-	struct build_id_event bev;
-	char filename[PATH_MAX];
-	u64 limit = offset + size;
-	int err = -1;
-
-	while (offset < limit) {
-		struct dso *dso;
-		ssize_t len;
-		struct list_head *head = &dsos__user;
-
-		if (read(input, &bev, sizeof(bev)) != sizeof(bev))
-			goto out;
-
-		if (self->needs_swap)
-			perf_event_header__bswap(&bev.header);
-
-		len = bev.header.size - sizeof(bev);
-		if (read(input, filename, len) != len)
-			goto out;
-
-		if (bev.header.misc & PERF_RECORD_MISC_KERNEL)
-			head = &dsos__kernel;
-
-		dso = __dsos__findnew(head, filename);
-		if (dso != NULL) {
-			dso__set_build_id(dso, &bev.build_id);
-			if (head == &dsos__kernel && filename[0] == '[')
-				dso->kernel = 1;
-		}
-
-		offset += bev.header.size;
-	}
-	err = 0;
-out:
-	return err;
-}
-
 static struct thread *perf_session__register_idle_thread(struct perf_session *self)
 {
 	struct thread *thread = perf_session__findnew(self, 0);
@@ -516,26 +488,33 @@ bool perf_session__has_traces(struct per
 	return true;
 }
 
-int perf_session__set_kallsyms_ref_reloc_sym(struct perf_session *self,
+int perf_session__set_kallsyms_ref_reloc_sym(struct map ** maps,
 					     const char *symbol_name,
 					     u64 addr)
 {
 	char *bracket;
 	enum map_type i;
+	struct ref_reloc_sym *ref;
+
+	ref = zalloc(sizeof(struct ref_reloc_sym));
+	if (ref == NULL)
+		return -ENOMEM;
 
-	self->ref_reloc_sym.name = strdup(symbol_name);
-	if (self->ref_reloc_sym.name == NULL)
+	ref->name = strdup(symbol_name);
+	if (ref->name == NULL) {
+		free(ref);
 		return -ENOMEM;
+	}
 
-	bracket = strchr(self->ref_reloc_sym.name, ']');
+	bracket = strchr(ref->name, ']');
 	if (bracket)
 		*bracket = '\0';
 
-	self->ref_reloc_sym.addr = addr;
+	ref->addr = addr;
 
 	for (i = 0; i < MAP__NR_TYPES; ++i) {
-		struct kmap *kmap = map__kmap(self->vmlinux_maps[i]);
-		kmap->ref_reloc_sym = &self->ref_reloc_sym;
+		struct kmap *kmap = map__kmap(maps[i]);
+		kmap->ref_reloc_sym = ref;
 	}
 
 	return 0;
diff -Nraup linux-2.6_tip0413/tools/perf/util/session.h linux-2.6_tip0413_perfkvm/tools/perf/util/session.h
--- linux-2.6_tip0413/tools/perf/util/session.h	2010-04-14 11:11:58.606252925 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/session.h	2010-04-14 11:13:17.321860837 +0800
@@ -15,17 +15,15 @@ struct perf_session {
 	struct perf_header	header;
 	unsigned long		size;
 	unsigned long		mmap_window;
-	struct map_groups	kmaps;
 	struct rb_root		threads;
 	struct thread		*last_match;
-	struct map		*vmlinux_maps[MAP__NR_TYPES];
+	struct rb_root		kerninfo_root;
 	struct events_stats	events_stats;
 	struct rb_root		stats_by_id;
 	unsigned long		event_total[PERF_RECORD_MAX];
 	unsigned long		unknown_events;
 	struct rb_root		hists;
 	u64			sample_type;
-	struct ref_reloc_sym	ref_reloc_sym;
 	int			fd;
 	int			cwdlen;
 	char			*cwd;
@@ -64,33 +62,13 @@ struct map_symbol *perf_session__resolve
 
 bool perf_session__has_traces(struct perf_session *self, const char *msg);
 
-int perf_header__read_build_ids(struct perf_header *self, int input,
-				u64 offset, u64 file_size);
-
-int perf_session__set_kallsyms_ref_reloc_sym(struct perf_session *self,
+int perf_session__set_kallsyms_ref_reloc_sym(struct map ** maps,
 					     const char *symbol_name,
 					     u64 addr);
 
 void mem_bswap_64(void *src, int byte_size);
 
-static inline int __perf_session__create_kernel_maps(struct perf_session *self,
-						struct dso *kernel)
-{
-	return __map_groups__create_kernel_maps(&self->kmaps,
-						self->vmlinux_maps, kernel);
-}
-
-static inline int perf_session__create_kernel_maps(struct perf_session *self)
-{
-	return map_groups__create_kernel_maps(&self->kmaps, self->vmlinux_maps);
-}
-
-static inline struct map *
-	perf_session__new_module_map(struct perf_session *self,
-				     u64 start, const char *filename)
-{
-	return map_groups__new_module(&self->kmaps, start, filename);
-}
+int perf_session__create_kernel_maps(struct perf_session *self);
 
 #ifdef NO_NEWT_SUPPORT
 static inline int perf_session__browse_hists(struct rb_root *hists __used,
diff -Nraup linux-2.6_tip0413/tools/perf/util/sort.h linux-2.6_tip0413_perfkvm/tools/perf/util/sort.h
--- linux-2.6_tip0413/tools/perf/util/sort.h	2010-04-14 11:11:58.610258472 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/sort.h	2010-04-14 11:13:17.321860837 +0800
@@ -44,6 +44,11 @@ extern enum sort_type sort__first_dimens
 struct hist_entry {
 	struct rb_node		rb_node;
 	u64			count;
+	u64			count_sys;
+	u64			count_us;
+	u64			count_guest_sys;
+	u64			count_guest_us;
+
 	/*
 	 * XXX WARNING!
 	 * thread _has_ to come after ms, see
diff -Nraup linux-2.6_tip0413/tools/perf/util/symbol.c linux-2.6_tip0413_perfkvm/tools/perf/util/symbol.c
--- linux-2.6_tip0413/tools/perf/util/symbol.c	2010-04-14 11:11:58.614279111 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/symbol.c	2010-04-14 16:51:51.803796961 +0800
@@ -28,6 +28,8 @@ static void dsos__add(struct list_head *
 static struct map *map__new2(u64 start, struct dso *dso, enum map_type type);
 static int dso__load_kernel_sym(struct dso *self, struct map *map,
 				symbol_filter_t filter);
+static int dso__load_guest_kernel_sym(struct dso *self, struct map *map,
+			symbol_filter_t filter);
 static int vmlinux_path__nr_entries;
 static char **vmlinux_path;
 
@@ -186,6 +188,7 @@ struct dso *dso__new(const char *name)
 		self->loaded = 0;
 		self->sorted_by_name = 0;
 		self->has_build_id = 0;
+		self->kernel = DSO_TYPE_USER;
 	}
 
 	return self;
@@ -402,12 +405,9 @@ int kallsyms__parse(const char *filename
 		char *symbol_name;
 
 		line_len = getline(&line, &n, file);
-		if (line_len < 0)
+		if (line_len < 0 || !line)
 			break;
 
-		if (!line)
-			goto out_failure;
-
 		line[--line_len] = '\0'; /* \n */
 
 		len = hex2u64(line, &start);
@@ -459,6 +459,7 @@ static int map__process_kallsym_symbol(v
 	 * map__split_kallsyms, when we have split the maps per module
 	 */
 	symbols__insert(root, sym);
+
 	return 0;
 }
 
@@ -489,6 +490,7 @@ static int dso__split_kallsyms(struct ds
 	struct rb_root *root = &self->symbols[map->type];
 	struct rb_node *next = rb_first(root);
 	int kernel_range = 0;
+	const char *root_dir;
 
 	while (next) {
 		char *module;
@@ -504,15 +506,32 @@ static int dso__split_kallsyms(struct ds
 			*module++ = '\0';
 
 			if (strcmp(curr_map->dso->short_name, module)) {
+				if (curr_map != map &&
+					self->kernel == DSO_TYPE_GUEST_KERNEL &&
+					is_default_guest(kmaps->this_kerninfo)) {
+					/*
+					 * We assume all symbols of a module are continuous in
+					 * kallsyms, so curr_map points to a module and all its
+					 * symbols are in its kmap. Mark it as loaded.
+					 */
+					dso__set_loaded(curr_map->dso, curr_map->type);
+				}
+
 				curr_map = map_groups__find_by_name(kmaps, map->type, module);
 				if (curr_map == NULL) {
-					pr_debug("/proc/{kallsyms,modules} "
+					if (kmaps->this_kerninfo)
+						root_dir = kmaps->this_kerninfo->root_dir;
+					else
+						root_dir = "";
+					pr_debug("%s/proc/{kallsyms,modules} "
 					         "inconsistency while looking "
-						 "for \"%s\" module!\n", module);
+						 "for \"%s\" module!\n",
+						 root_dir, module);
 					return -1;
 				}
 
-				if (curr_map->dso->loaded)
+				if (curr_map->dso->loaded &&
+					!is_default_guest(kmaps->this_kerninfo))
 					goto discard_symbol;
 			}
 			/*
@@ -525,13 +544,21 @@ static int dso__split_kallsyms(struct ds
 			char dso_name[PATH_MAX];
 			struct dso *dso;
 
-			snprintf(dso_name, sizeof(dso_name), "[kernel].%d",
-				 kernel_range++);
+			if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+				snprintf(dso_name, sizeof(dso_name),
+					"[guest.kernel].%d",
+					kernel_range++);
+			else
+				snprintf(dso_name, sizeof(dso_name),
+					"[kernel].%d",
+					kernel_range++);
 
 			dso = dso__new(dso_name);
 			if (dso == NULL)
 				return -1;
 
+			dso->kernel = self->kernel;
+
 			curr_map = map__new2(pos->start, dso, map->type);
 			if (curr_map == NULL) {
 				dso__delete(dso);
@@ -555,6 +582,12 @@ discard_symbol:		rb_erase(&pos->rb_node,
 		}
 	}
 
+	if (curr_map != map &&
+	    self->kernel == DSO_TYPE_GUEST_KERNEL &&
+	    is_default_guest(kmaps->this_kerninfo)) {
+		dso__set_loaded(curr_map->dso, curr_map->type);
+	}
+
 	return count;
 }
 
@@ -565,7 +598,10 @@ int dso__load_kallsyms(struct dso *self,
 		return -1;
 
 	symbols__fixup_end(&self->symbols[map->type]);
-	self->origin = DSO__ORIG_KERNEL;
+	if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+		self->origin = DSO__ORIG_GUEST_KERNEL;
+	else
+		self->origin = DSO__ORIG_KERNEL;
 
 	return dso__split_kallsyms(self, map, filter);
 }
@@ -952,7 +988,7 @@ static int dso__load_sym(struct dso *sel
 	nr_syms = shdr.sh_size / shdr.sh_entsize;
 
 	memset(&sym, 0, sizeof(sym));
-	if (!self->kernel) {
+	if (self->kernel == DSO_TYPE_USER) {
 		self->adjust_symbols = (ehdr.e_type == ET_EXEC ||
 				elf_section_by_name(elf, &ehdr, &shdr,
 						     ".gnu.prelink_undo",
@@ -984,7 +1020,7 @@ static int dso__load_sym(struct dso *sel
 
 		section_name = elf_sec__name(&shdr, secstrs);
 
-		if (self->kernel || kmodule) {
+		if (self->kernel != DSO_TYPE_USER || kmodule) {
 			char dso_name[PATH_MAX];
 
 			if (strcmp(section_name,
@@ -1011,6 +1047,7 @@ static int dso__load_sym(struct dso *sel
 				curr_dso = dso__new(dso_name);
 				if (curr_dso == NULL)
 					goto out_elf_end;
+				curr_dso->kernel = self->kernel;
 				curr_map = map__new2(start, curr_dso,
 						     map->type);
 				if (curr_map == NULL) {
@@ -1021,7 +1058,7 @@ static int dso__load_sym(struct dso *sel
 				curr_map->unmap_ip = identity__map_ip;
 				curr_dso->origin = self->origin;
 				map_groups__insert(kmap->kmaps, curr_map);
-				dsos__add(&dsos__kernel, curr_dso);
+				dsos__add(&self->node, curr_dso);
 				dso__set_loaded(curr_dso, map->type);
 			} else
 				curr_dso = curr_map->dso;
@@ -1083,7 +1120,7 @@ static bool dso__build_id_equal(const st
 	return memcmp(self->build_id, build_id, sizeof(self->build_id)) == 0;
 }
 
-static bool __dsos__read_build_ids(struct list_head *head, bool with_hits)
+bool __dsos__read_build_ids(struct list_head *head, bool with_hits)
 {
 	bool have_build_id = false;
 	struct dso *pos;
@@ -1101,13 +1138,6 @@ static bool __dsos__read_build_ids(struc
 	return have_build_id;
 }
 
-bool dsos__read_build_ids(bool with_hits)
-{
-	bool kbuildids = __dsos__read_build_ids(&dsos__kernel, with_hits),
-	     ubuildids = __dsos__read_build_ids(&dsos__user, with_hits);
-	return kbuildids || ubuildids;
-}
-
 /*
  * Align offset to 4 bytes as needed for note name and descriptor data.
  */
@@ -1242,6 +1272,8 @@ char dso__symtab_origin(const struct dso
 		[DSO__ORIG_BUILDID] =  'b',
 		[DSO__ORIG_DSO] =      'd',
 		[DSO__ORIG_KMODULE] =  'K',
+		[DSO__ORIG_GUEST_KERNEL] =  'g',
+		[DSO__ORIG_GUEST_KMODULE] =  'G',
 	};
 
 	if (self == NULL || self->origin == DSO__ORIG_NOT_FOUND)
@@ -1257,11 +1289,20 @@ int dso__load(struct dso *self, struct m
 	char build_id_hex[BUILD_ID_SIZE * 2 + 1];
 	int ret = -1;
 	int fd;
+	struct kernel_info *kerninfo;
+	const char *root_dir;
 
 	dso__set_loaded(self, map->type);
 
-	if (self->kernel)
+	if (self->kernel == DSO_TYPE_KERNEL)
 		return dso__load_kernel_sym(self, map, filter);
+	else if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+		return dso__load_guest_kernel_sym(self, map, filter);
+
+	if (map->groups && map->groups->this_kerninfo)
+		kerninfo = map->groups->this_kerninfo;
+	else
+		kerninfo = NULL;
 
 	name = malloc(size);
 	if (!name)
@@ -1315,6 +1356,13 @@ more:
 		case DSO__ORIG_DSO:
 			snprintf(name, size, "%s", self->long_name);
 			break;
+		case DSO__ORIG_GUEST_KMODULE:
+			if (map->groups && map->groups->this_kerninfo)
+				root_dir = map->groups->this_kerninfo->root_dir;
+			else
+				root_dir = "";
+			snprintf(name, size, "%s%s", root_dir, self->long_name);
+			break;
 
 		default:
 			goto out;
@@ -1368,7 +1416,8 @@ struct map *map_groups__find_by_name(str
 	return NULL;
 }
 
-static int dso__kernel_module_get_build_id(struct dso *self)
+static int dso__kernel_module_get_build_id(struct dso *self,
+				const char * root_dir)
 {
 	char filename[PATH_MAX];
 	/*
@@ -1378,8 +1427,8 @@ static int dso__kernel_module_get_build_
 	const char *name = self->short_name + 1;
 
 	snprintf(filename, sizeof(filename),
-		 "/sys/module/%.*s/notes/.note.gnu.build-id",
-		 (int)strlen(name - 1), name);
+		 "%s/sys/module/%.*s/notes/.note.gnu.build-id",
+		 root_dir, (int)strlen(name) - 1, name);
 
 	if (sysfs__read_build_id(filename, self->build_id,
 				 sizeof(self->build_id)) == 0)
@@ -1388,7 +1437,8 @@ static int dso__kernel_module_get_build_
 	return 0;
 }
 
-static int map_groups__set_modules_path_dir(struct map_groups *self, char *dir_name)
+static int map_groups__set_modules_path_dir(struct map_groups *self,
+				const char *dir_name)
 {
 	struct dirent *dent;
 	DIR *dir = opendir(dir_name);
@@ -1400,8 +1450,14 @@ static int map_groups__set_modules_path_
 
 	while ((dent = readdir(dir)) != NULL) {
 		char path[PATH_MAX];
+		struct stat st;
+
+		/*sshfs might return bad dent->d_type, so we have to stat*/
+		sprintf(path, "%s/%s", dir_name, dent->d_name);
+		if (stat(path, &st))
+			continue;
 
-		if (dent->d_type == DT_DIR) {
+		if (S_ISDIR(st.st_mode)) {
 			if (!strcmp(dent->d_name, ".") ||
 			    !strcmp(dent->d_name, ".."))
 				continue;
@@ -1433,7 +1489,7 @@ static int map_groups__set_modules_path_
 			if (long_name == NULL)
 				goto failure;
 			dso__set_long_name(map->dso, long_name);
-			dso__kernel_module_get_build_id(map->dso);
+			dso__kernel_module_get_build_id(map->dso, "");
 		}
 	}
 
@@ -1443,16 +1499,46 @@ failure:
 	return -1;
 }
 
-static int map_groups__set_modules_path(struct map_groups *self)
+static char * get_kernel_version(const char * root_dir)
 {
-	struct utsname uts;
+	char version[PATH_MAX];
+	FILE *file;
+	char *name, *tmp;
+	const char * prefix="Linux version ";
+
+	sprintf(version, "%s/proc/version", root_dir);
+	file = fopen(version, "r");
+	if (!file)
+		return NULL;
+
+	version[0] = '\0';
+	tmp = fgets(version, sizeof(version), file);
+	fclose(file);
+
+	name = strstr(version, prefix);
+	if (!name)
+		return NULL;
+	name += strlen(prefix);
+	tmp = strchr(name, ' ');
+	if (tmp)
+		*tmp = '\0';
+
+	return strdup(name);
+}
+
+static int map_groups__set_modules_path(struct map_groups *self,
+				const char * root_dir)
+{
+	char *version;
 	char modules_path[PATH_MAX];
 
-	if (uname(&uts) < 0)
+	version = get_kernel_version(root_dir);
+	if (!version)
 		return -1;
 
-	snprintf(modules_path, sizeof(modules_path), "/lib/modules/%s/kernel",
-		 uts.release);
+	snprintf(modules_path, sizeof(modules_path), "%s/lib/modules/%s/kernel",
+		 root_dir, version);
+	free(version);
 
 	return map_groups__set_modules_path_dir(self, modules_path);
 }
@@ -1477,11 +1563,13 @@ static struct map *map__new2(u64 start, 
 }
 
 struct map *map_groups__new_module(struct map_groups *self, u64 start,
-				   const char *filename)
+				const char *filename,
+				struct kernel_info *kerninfo)
 {
 	struct map *map;
-	struct dso *dso = __dsos__findnew(&dsos__kernel, filename);
+	struct dso *dso;
 
+	dso = __dsos__findnew(&kerninfo->dsos__kernel, filename);
 	if (dso == NULL)
 		return NULL;
 
@@ -1489,21 +1577,37 @@ struct map *map_groups__new_module(struc
 	if (map == NULL)
 		return NULL;
 
-	dso->origin = DSO__ORIG_KMODULE;
+	if (is_host_kernel(kerninfo))
+		dso->origin = DSO__ORIG_KMODULE;
+	else
+		dso->origin = DSO__ORIG_GUEST_KMODULE;
 	map_groups__insert(self, map);
 	return map;
 }
 
-static int map_groups__create_modules(struct map_groups *self)
+static int map_groups__create_modules(struct kernel_info *kerninfo)
 {
 	char *line = NULL;
 	size_t n;
-	FILE *file = fopen("/proc/modules", "r");
+	FILE *file;
 	struct map *map;
+	const char * root_dir;
+	const char *modules;
+	char path[PATH_MAX];
+
+	if(is_default_guest(kerninfo))
+		modules = symbol_conf.default_guest_modules;
+	else {
+		sprintf(path, "%s/proc/modules", kerninfo->root_dir);
+		modules = path;
+	}
 
+	file = fopen(modules, "r");
 	if (file == NULL)
 		return -1;
 
+	root_dir = kerninfo->root_dir;
+
 	while (!feof(file)) {
 		char name[PATH_MAX];
 		u64 start;
@@ -1532,16 +1636,17 @@ static int map_groups__create_modules(st
 		*sep = '\0';
 
 		snprintf(name, sizeof(name), "[%s]", line);
-		map = map_groups__new_module(self, start, name);
+		map = map_groups__new_module(&kerninfo->kmaps,
+				start, name, kerninfo);
 		if (map == NULL)
 			goto out_delete_line;
-		dso__kernel_module_get_build_id(map->dso);
+		dso__kernel_module_get_build_id(map->dso, root_dir);
 	}
 
 	free(line);
 	fclose(file);
 
-	return map_groups__set_modules_path(self);
+	return map_groups__set_modules_path(&kerninfo->kmaps, root_dir);
 
 out_delete_line:
 	free(line);
@@ -1708,8 +1813,54 @@ out_fixup:
 	return err;
 }
 
-LIST_HEAD(dsos__user);
-LIST_HEAD(dsos__kernel);
+static int dso__load_guest_kernel_sym(struct dso *self, struct map *map,
+				symbol_filter_t filter)
+{
+	int err;
+	const char *kallsyms_filename = NULL;
+	struct kernel_info *kerninfo;
+	char path[PATH_MAX];
+
+	if (!map->groups) {
+		pr_debug("Guest kernel map hasn't the point to groups\n");
+		return -1;
+	}
+	kerninfo = map->groups->this_kerninfo;
+
+	if (is_default_guest(kerninfo)) {
+		/*
+		 * if the user specified a vmlinux filename, use it and only
+		 * it, reporting errors to the user if it cannot be used.
+		 * Or use file guest_kallsyms inputted by user on commandline
+		 */
+		if (symbol_conf.default_guest_vmlinux_name != NULL) {
+			err = dso__load_vmlinux(self, map,
+				symbol_conf.default_guest_vmlinux_name, filter);
+			goto out_try_fixup;
+		}
+
+		kallsyms_filename = symbol_conf.default_guest_kallsyms;
+		if (!kallsyms_filename)
+			return -1;
+	} else {
+		sprintf(path, "%s/proc/kallsyms", kerninfo->root_dir);
+		kallsyms_filename = path;
+	}
+
+	err = dso__load_kallsyms(self, kallsyms_filename, map, filter);
+	if (err > 0)
+		pr_debug("Using %s for symbols\n", kallsyms_filename);
+
+out_try_fixup:
+	if (err > 0) {
+		if (kallsyms_filename != NULL)
+			dso__set_long_name(self, strdup("[guest.kernel.kallsyms]"));
+		map__fixup_start(map);
+		map__fixup_end(map);
+	}
+
+	return err;
+}
 
 static void dsos__add(struct list_head *head, struct dso *dso)
 {
@@ -1752,10 +1903,16 @@ static void __dsos__fprintf(struct list_
 	}
 }
 
-void dsos__fprintf(FILE *fp)
+void dsos__fprintf(struct rb_root *kerninfo_root, FILE *fp)
 {
-	__dsos__fprintf(&dsos__kernel, fp);
-	__dsos__fprintf(&dsos__user, fp);
+	struct rb_node *nd;
+
+	for (nd = rb_first(kerninfo_root); nd; nd = rb_next(nd)) {
+		struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+				rb_node);
+		__dsos__fprintf(&pos->dsos__kernel, fp);
+		__dsos__fprintf(&pos->dsos__user, fp);
+	}
 }
 
 static size_t __dsos__fprintf_buildid(struct list_head *head, FILE *fp,
@@ -1773,10 +1930,21 @@ static size_t __dsos__fprintf_buildid(st
 	return ret;
 }
 
-size_t dsos__fprintf_buildid(FILE *fp, bool with_hits)
+size_t dsos__fprintf_buildid(struct rb_root *kerninfo_root,
+		FILE *fp, bool with_hits)
 {
-	return (__dsos__fprintf_buildid(&dsos__kernel, fp, with_hits) +
-		__dsos__fprintf_buildid(&dsos__user, fp, with_hits));
+	struct rb_node *nd;
+	size_t ret = 0;
+
+	for (nd = rb_first(kerninfo_root); nd; nd = rb_next(nd)) {
+		struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+				rb_node);
+		ret += __dsos__fprintf_buildid(&pos->dsos__kernel,
+					fp, with_hits);
+		ret += __dsos__fprintf_buildid(&pos->dsos__user,
+					fp, with_hits);
+	}
+	return ret;
 }
 
 struct dso *dso__new_kernel(const char *name)
@@ -1785,28 +1953,55 @@ struct dso *dso__new_kernel(const char *
 
 	if (self != NULL) {
 		dso__set_short_name(self, "[kernel]");
-		self->kernel	 = 1;
+		self->kernel = DSO_TYPE_KERNEL;
+	}
+
+	return self;
+}
+
+struct dso *dso__new_guest_kernel(const char *name)
+{
+	struct dso *self = dso__new(name ?: "[guest.kernel.kallsyms]");
+
+	if (self != NULL) {
+		dso__set_short_name(self, "[guest.kernel]");
+		self->kernel = DSO_TYPE_GUEST_KERNEL;
 	}
 
 	return self;
 }
 
-void dso__read_running_kernel_build_id(struct dso *self)
+void dso__read_running_kernel_build_id(struct dso *self,
+			struct kernel_info *kerninfo)
 {
-	if (sysfs__read_build_id("/sys/kernel/notes", self->build_id,
+	char path[PATH_MAX];
+
+	if (is_default_guest(kerninfo))
+		return;
+	sprintf(path, "%s/sys/kernel/notes", kerninfo->root_dir);
+	if (sysfs__read_build_id(path, self->build_id,
 				 sizeof(self->build_id)) == 0)
 		self->has_build_id = true;
 }
 
-static struct dso *dsos__create_kernel(const char *vmlinux)
+static struct dso *dsos__create_kernel(struct kernel_info *kerninfo)
 {
-	struct dso *kernel = dso__new_kernel(vmlinux);
+	const char * vmlinux_name = NULL;
+	struct dso *kernel;
 
-	if (kernel != NULL) {
-		dso__read_running_kernel_build_id(kernel);
-		dsos__add(&dsos__kernel, kernel);
+	if (is_host_kernel(kerninfo)) {
+		vmlinux_name = symbol_conf.vmlinux_name;
+		kernel = dso__new_kernel(vmlinux_name);
+	} else {
+		if (is_default_guest(kerninfo))
+			vmlinux_name = symbol_conf.default_guest_vmlinux_name;
+		kernel = dso__new_guest_kernel(vmlinux_name);
 	}
 
+	if (kernel != NULL) {
+		dso__read_running_kernel_build_id(kernel, kerninfo);
+		dsos__add(&kerninfo->dsos__kernel, kernel);
+	}
 	return kernel;
 }
 
@@ -1950,23 +2145,29 @@ out_free_comm_list:
 	return -1;
 }
 
-int map_groups__create_kernel_maps(struct map_groups *self,
-				   struct map *vmlinux_maps[MAP__NR_TYPES])
+int map_groups__create_kernel_maps(struct rb_root *kerninfo_root, pid_t pid)
 {
-	struct dso *kernel = dsos__create_kernel(symbol_conf.vmlinux_name);
+	struct kernel_info *kerninfo;
+	struct dso *kernel;
 
+	kerninfo = kerninfo__findnew(kerninfo_root, pid);
+	if (kerninfo == NULL)
+		return -1;
+	kernel = dsos__create_kernel(kerninfo);
 	if (kernel == NULL)
 		return -1;
 
-	if (__map_groups__create_kernel_maps(self, vmlinux_maps, kernel) < 0)
+	if (__map_groups__create_kernel_maps(&kerninfo->kmaps,
+			kerninfo->vmlinux_maps, kernel) < 0)
 		return -1;
 
-	if (symbol_conf.use_modules && map_groups__create_modules(self) < 0)
+	if (symbol_conf.use_modules &&
+		map_groups__create_modules(kerninfo) < 0)
 		pr_debug("Problems creating module maps, continuing anyway...\n");
 	/*
 	 * Now that we have all the maps created, just set the ->end of them:
 	 */
-	map_groups__fixup_end(self);
+	map_groups__fixup_end(&kerninfo->kmaps);
 	return 0;
 }
 
@@ -2012,3 +2213,47 @@ char *strxfrchar(char *s, char from, cha
 
 	return s;
 }
+
+int map_groups__create_guest_kernel_maps(struct rb_root *kerninfo_root)
+{
+	int ret = 0;
+	struct dirent **namelist = NULL;
+	int i, items = 0;
+	char path[PATH_MAX];
+	pid_t pid;
+
+	if (symbol_conf.default_guest_vmlinux_name ||
+	    symbol_conf.default_guest_modules ||
+	    symbol_conf.default_guest_kallsyms) {
+		map_groups__create_kernel_maps(kerninfo_root,
+					DEFAULT_GUEST_KERNEL_ID);
+	}
+
+	if (symbol_conf.guestmount) {
+		items = scandir(symbol_conf.guestmount, &namelist, NULL, NULL);
+		if (items <= 0)
+			return -ENOENT;
+		for (i = 0; i < items; i++) {
+			if (!isdigit(namelist[i]->d_name[0])) {
+				/* Filter out . and .. */
+				continue;
+			}
+			pid = atoi(namelist[i]->d_name);
+			sprintf(path, "%s/%s/proc/kallsyms",
+				symbol_conf.guestmount,
+				namelist[i]->d_name);
+			ret = access(path, R_OK);
+			if (ret) {
+				pr_debug("Can't access file %s\n", path);
+				goto failure;
+			}
+			map_groups__create_kernel_maps(kerninfo_root,
+							pid);
+		}
+failure:
+		free(namelist);
+	}
+
+	return ret;
+}
+
diff -Nraup linux-2.6_tip0413/tools/perf/util/symbol.h linux-2.6_tip0413_perfkvm/tools/perf/util/symbol.h
--- linux-2.6_tip0413/tools/perf/util/symbol.h	2010-04-14 11:11:58.766255670 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/symbol.h	2010-04-14 11:13:17.321860837 +0800
@@ -69,10 +69,15 @@ struct symbol_conf {
 			show_nr_samples,
 			use_callchain,
 			exclude_other,
-			full_paths;
+			full_paths,
+			show_cpu_utilization;
 	const char	*vmlinux_name,
 			*field_sep;
-	char            *dso_list_str,
+	const char	*default_guest_vmlinux_name,
+			*default_guest_kallsyms,
+			*default_guest_modules;
+	const char	*guestmount;
+	char		*dso_list_str,
 			*comm_list_str,
 			*sym_list_str,
 			*col_width_list_str;
@@ -106,6 +111,13 @@ struct addr_location {
 	u64	      addr;
 	char	      level;
 	bool	      filtered;
+	unsigned int  cpumode;
+};
+
+enum dso_kernel_type {
+	DSO_TYPE_USER = 0,
+	DSO_TYPE_KERNEL,
+	DSO_TYPE_GUEST_KERNEL
 };
 
 struct dso {
@@ -115,7 +127,7 @@ struct dso {
 	u8		 adjust_symbols:1;
 	u8		 slen_calculated:1;
 	u8		 has_build_id:1;
-	u8		 kernel:1;
+	enum dso_kernel_type	kernel;
 	u8		 hit:1;
 	u8		 annotate_warned:1;
 	unsigned char	 origin;
@@ -131,6 +143,7 @@ struct dso {
 
 struct dso *dso__new(const char *name);
 struct dso *dso__new_kernel(const char *name);
+struct dso *dso__new_guest_kernel(const char *name);
 void dso__delete(struct dso *self);
 
 bool dso__loaded(const struct dso *self, enum map_type type);
@@ -143,34 +156,30 @@ static inline void dso__set_loaded(struc
 
 void dso__sort_by_name(struct dso *self, enum map_type type);
 
-extern struct list_head dsos__user, dsos__kernel;
-
 struct dso *__dsos__findnew(struct list_head *head, const char *name);
 
-static inline struct dso *dsos__findnew(const char *name)
-{
-	return __dsos__findnew(&dsos__user, name);
-}
-
 int dso__load(struct dso *self, struct map *map, symbol_filter_t filter);
 int dso__load_vmlinux_path(struct dso *self, struct map *map,
 			   symbol_filter_t filter);
 int dso__load_kallsyms(struct dso *self, const char *filename, struct map *map,
 		       symbol_filter_t filter);
-void dsos__fprintf(FILE *fp);
-size_t dsos__fprintf_buildid(FILE *fp, bool with_hits);
+void dsos__fprintf(struct rb_root *kerninfo_root, FILE *fp);
+size_t dsos__fprintf_buildid(struct rb_root *kerninfo_root,
+		FILE *fp, bool with_hits);
 
 size_t dso__fprintf_buildid(struct dso *self, FILE *fp);
 size_t dso__fprintf(struct dso *self, enum map_type type, FILE *fp);
 
 enum dso_origin {
 	DSO__ORIG_KERNEL = 0,
+	DSO__ORIG_GUEST_KERNEL,
 	DSO__ORIG_JAVA_JIT,
 	DSO__ORIG_BUILD_ID_CACHE,
 	DSO__ORIG_FEDORA,
 	DSO__ORIG_UBUNTU,
 	DSO__ORIG_BUILDID,
 	DSO__ORIG_DSO,
+	DSO__ORIG_GUEST_KMODULE,
 	DSO__ORIG_KMODULE,
 	DSO__ORIG_NOT_FOUND,
 };
@@ -178,19 +187,26 @@ enum dso_origin {
 char dso__symtab_origin(const struct dso *self);
 void dso__set_long_name(struct dso *self, char *name);
 void dso__set_build_id(struct dso *self, void *build_id);
-void dso__read_running_kernel_build_id(struct dso *self);
+void dso__read_running_kernel_build_id(struct dso *self,
+		struct kernel_info *kerninfo);
 struct symbol *dso__find_symbol(struct dso *self, enum map_type type, u64 addr);
 struct symbol *dso__find_symbol_by_name(struct dso *self, enum map_type type,
 					const char *name);
 
 int filename__read_build_id(const char *filename, void *bf, size_t size);
 int sysfs__read_build_id(const char *filename, void *bf, size_t size);
-bool dsos__read_build_ids(bool with_hits);
+bool __dsos__read_build_ids(struct list_head *head, bool with_hits);
 int build_id__sprintf(const u8 *self, int len, char *bf);
 int kallsyms__parse(const char *filename, void *arg,
 		    int (*process_symbol)(void *arg, const char *name,
 					  char type, u64 start));
 
+int __map_groups__create_kernel_maps(struct map_groups *self,
+			struct map *vmlinux_maps[MAP__NR_TYPES],
+			struct dso *kernel);
+int map_groups__create_kernel_maps(struct rb_root *kerninfo_root, pid_t pid);
+int map_groups__create_guest_kernel_maps(struct rb_root *kerninfo_root);
+
 int symbol__init(void);
 bool symbol_type__is_a(char symbol_type, enum map_type map_type);
 
diff -Nraup linux-2.6_tip0413/tools/perf/util/thread.h linux-2.6_tip0413_perfkvm/tools/perf/util/thread.h
--- linux-2.6_tip0413/tools/perf/util/thread.h	2010-04-14 11:11:58.594236160 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/thread.h	2010-04-14 11:13:17.321860837 +0800
@@ -33,12 +33,12 @@ static inline struct map *thread__find_m
 
 void thread__find_addr_map(struct thread *self,
 			   struct perf_session *session, u8 cpumode,
-			   enum map_type type, u64 addr,
+			   enum map_type type, pid_t pid, u64 addr,
 			   struct addr_location *al);
 
 void thread__find_addr_location(struct thread *self,
 				struct perf_session *session, u8 cpumode,
-				enum map_type type, u64 addr,
+				enum map_type type, pid_t pid, u64 addr,
 				struct addr_location *al,
 				symbol_filter_t filter);
 #endif	/* __PERF_THREAD_H */



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-14  9:20 ` Avi Kivity
  2010-04-14  9:43   ` Sheng Yang
  2010-04-14 10:43   ` Ingo Molnar
@ 2030-04-15  1:04   ` Zhang, Yanmin
  2010-04-15  8:05     ` Avi Kivity
  2 siblings, 1 reply; 28+ messages in thread
From: Zhang, Yanmin @ 2030-04-15  1:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Peter Zijlstra, Sheng Yang, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Wed, 2010-04-14 at 12:20 +0300, Avi Kivity wrote:
> On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
> > Here is the new patch of V3 against tip/master of April 13th
> > if anyone wants to try it.
> >
> >    
> 
> Thanks for persisting despite the flames.
> 
> Can you please separate arch/x86/kvm part of the patch?  That will make 
> for easier reviewing, and will need to go through separate trees.
I should do so definitely, and will do so in next version which also fixes
some issues pointed by Ingo.

> 
> Sheng, did you make any progress with the NMI injection issue?
> 
> > +
> > diff -Nraup linux-2.6_tip0413/arch/x86/kvm/x86.c linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c
> > --- linux-2.6_tip0413/arch/x86/kvm/x86.c	2010-04-14 11:11:04.341042024 +0800
> > +++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c	2010-04-14 11:32:45.841278890 +0800
> > @@ -3765,6 +3765,35 @@ static void kvm_timer_init(void)
> >   	}
> >   }
> >
> > +static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
> > +
> > +static int kvm_is_in_guest(void)
> > +{
> > +	return percpu_read(current_vcpu) != NULL;
> >    
> 
> An even more accurate way to determine this is to check whether the 
> interrupt frame points back at the 'int $2' instruction.  However we 
> plan to switch to a self-IPI method to inject the NMI, and I'm not sure 
> wether APIC NMIs are accepted on an instruction boundary or whether 
> there's some latency involved.
Yes. But the frame pointer checking seems a little complicated.

> 
> > +static unsigned long kvm_get_guest_ip(void)
> > +{
> > +	unsigned long ip = 0;
> > +	if (percpu_read(current_vcpu))
> > +		ip = kvm_rip_read(percpu_read(current_vcpu));
> > +	return ip;
> > +}
> >    
> 
> This may be racy.  kvm_rip_read() accesses a cache in memory; if we're 
> in the process of updating the cache, then we may read a stale value.  
> See below.
Right. The racy window seems too big.

> 
> >
> >   	trace_kvm_entry(vcpu->vcpu_id);
> > +
> > +	percpu_write(current_vcpu, vcpu);
> >   	kvm_x86_ops->run(vcpu);
> > +	percpu_write(current_vcpu, NULL);
> >    
> 
> If you move this around the 'int $2' instructions you will close the 
> race, as a stray NMI won't catch us updating the rip cache.  But that 
> depends on whether self-IPI is accepted on the next instruction or not.
Right. The kernel part has dependency on the self-IPI implementation.
I will move above percpu_write(current_vcpu, vcpu) (or a new wrapper function)
just around 'int $2'.

Sheng would find a solution on the self-IPI delivery. Let's separate my patch
and self-IPI as 2 issues as we don't know when the self-IPI delivery would be
resolved.

Thanks,
Yanmin



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side
  2010-04-15  8:05     ` Avi Kivity
@ 2030-04-15  8:57       ` Zhang, Yanmin
  2010-04-15  9:04         ` oerg Roedel
  2010-04-27 19:03         ` [PATCH] Psychovisually-optimized HZ setting (2.6.33.3) Uwaysi Bin Kareem
  0 siblings, 2 replies; 28+ messages in thread
From: Zhang, Yanmin @ 2030-04-15  8:57 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Peter Zijlstra, Sheng Yang, linux-kernel, kvm,
	Marcelo Tosatti, oerg Roedel, Jes Sorensen, Gleb Natapov,
	Zachary Amsden, zhiteng.huang, tim.c.chen,
	Arnaldo Carvalho de Melo

On Thu, 2010-04-15 at 11:05 +0300, Avi Kivity wrote:
> On 04/15/2030 04:04 AM, Zhang, Yanmin wrote:
> >
> >> An even more accurate way to determine this is to check whether the
> >> interrupt frame points back at the 'int $2' instruction.  However we
> >> plan to switch to a self-IPI method to inject the NMI, and I'm not sure
> >> wether APIC NMIs are accepted on an instruction boundary or whether
> >> there's some latency involved.
> >>      
> > Yes. But the frame pointer checking seems a little complicated.
> >    
> 
> An even bigger disadvantage is that it won't work with Sheng's patch, 
> self-NMIs are not synchronous.
> 
> >>>    	trace_kvm_entry(vcpu->vcpu_id);
> >>> +
> >>> +	percpu_write(current_vcpu, vcpu);
> >>>    	kvm_x86_ops->run(vcpu);
> >>> +	percpu_write(current_vcpu, NULL);
> >>>
> >>>        
> >> If you move this around the 'int $2' instructions you will close the
> >> race, as a stray NMI won't catch us updating the rip cache.  But that
> >> depends on whether self-IPI is accepted on the next instruction or not.
> >>      
> > Right. The kernel part has dependency on the self-IPI implementation.
> > I will move above percpu_write(current_vcpu, vcpu) (or a new wrapper function)
> > just around 'int $2'.
> >
> >    
> 
> Or create a new function to inject the interrupt in x86.c.  That will 
> reduce duplication between svm.c and vmx.c.
I checked svm.c and it seems svm.c doesn't trigger a NMI to host if the NMI
happens in guest os. In addition, svm_complete_interrupts is called after
interrupt is enabled.



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-04-27 21:50 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2030-04-14  9:05 [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side Zhang, Yanmin
2010-04-14  9:20 ` Avi Kivity
2010-04-14  9:43   ` Sheng Yang
2010-04-14  9:57     ` Avi Kivity
2010-04-14 10:14       ` Sheng Yang
2010-04-14 10:19         ` Avi Kivity
2010-04-14 10:27           ` Sheng Yang
2010-04-14 10:33             ` Avi Kivity
2010-04-14 10:36               ` Sheng Yang
2010-04-14 10:43   ` Ingo Molnar
2010-04-14 11:17     ` Avi Kivity
2030-04-15  1:04   ` Zhang, Yanmin
2010-04-15  8:05     ` Avi Kivity
2030-04-15  8:57       ` Zhang, Yanmin
2010-04-15  9:04         ` oerg Roedel
2010-04-15  9:09           ` Avi Kivity
2010-04-15  9:44             ` oerg Roedel
2010-04-15  9:48               ` Avi Kivity
2010-04-15 10:40                 ` Joerg Roedel
2010-04-15 10:44                   ` Avi Kivity
2010-04-15 14:08                     ` Sheng Yang
2010-04-17 18:12                       ` Avi Kivity
2010-04-19  8:25                         ` Avi Kivity
2010-04-20  3:32                           ` Sheng Yang
2010-04-20  9:38                             ` Avi Kivity
2010-04-27 19:03         ` [PATCH] Psychovisually-optimized HZ setting (2.6.33.3) Uwaysi Bin Kareem
2010-04-27 19:51           ` Randy Dunlap
2010-04-27 21:50           ` Valdis.Kletnieks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).