From mboxrd@z Thu Jan  1 00:00:00 1970
From: Christoffer Dall <christoffer.dall@linaro.org>
Subject: Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
Date: Tue, 2 Jun 2015 11:27:59 +0200
Message-ID: <20150602092759.GA7783@cbox>
References: <1432838950-28774-1-git-send-email-christoffer.dall@linaro.org>
 <5568E987.5010702@samsung.com> <20150531065923.GB763@cbox>
 <556C7EC6.3010105@samsung.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org, Marc Zyngier <marc.zyngier@arm.com>,
 borntraeger@de.ibm.com, Paolo Bonzini <pbonzini@redhat.com>,
 kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org
To: Mario Smarduch <m.smarduch@samsung.com>
Return-path: <kvmarm-bounces@lists.cs.columbia.edu>
Content-Disposition: inline
In-Reply-To: <556C7EC6.3010105@samsung.com>
List-Unsubscribe: <https://lists.cs.columbia.edu/mailman/options/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=unsubscribe>
List-Archive: <https://lists.cs.columbia.edu/pipermail/kvmarm>
List-Post: <mailto:kvmarm@lists.cs.columbia.edu>
List-Help: <mailto:kvmarm-request@lists.cs.columbia.edu?subject=help>
List-Subscribe: <https://lists.cs.columbia.edu/mailman/listinfo/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=subscribe>
Errors-To: kvmarm-bounces@lists.cs.columbia.edu
Sender: kvmarm-bounces@lists.cs.columbia.edu
List-Id: kvm.vger.kernel.org

On Mon, Jun 01, 2015 at 08:48:22AM -0700, Mario Smarduch wrote:
> On 05/30/2015 11:59 PM, Christoffer Dall wrote:
> > Hi Mario,
> > 
> > On Fri, May 29, 2015 at 03:34:47PM -0700, Mario Smarduch wrote:
> >> On 05/28/2015 11:49 AM, Christoffer Dall wrote:
> >>> Until now we have been calling kvm_guest_exit after re-enabling
> >>> interrupts when we come back from the guest, but this has the
> >>> unfortunate effect that CPU time accounting done in the context of timer
> >>> interrupts occurring while the guest is running doesn't properly notice
> >>> that the time since the last tick was spent in the guest.
> >>>
> >>> Inspired by the comment in the x86 code, move the kvm_guest_exit() call
> >>> below the local_irq_enable() call and change __kvm_guest_exit() to
> >>> kvm_guest_exit(), because we are now calling this function with
> >>> interrupts enabled.  We have to now explicitly disable preemption and
> >>> not enable preemption before we've called kvm_guest_exit(), since
> >>> otherwise we could be preempted and everything happening before we
> >>> eventually get scheduled again would be accounted for as guest time.
> >>>
> >>> At the same time, move the trace_kvm_exit() call outside of the atomic
> >>> section, since there is no reason for us to do that with interrupts
> >>> disabled.
> >>>
> >>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> >>> ---
> >>> This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
> >>> rework recently posted by Christian Borntraeger.  I hope I got the logic
> >>> of this right, there were 2 slightly worrying facts about this:
> >>>
> >>> First, we now enable and disable and enable interrupts on each exit
> >>> path, but I couldn't see any performance overhead on hackbench - yes the
> >>> only benchmark we care about.
> >>>
> >>> Second, looking at the ppc and mips code, they seem to also call
> >>> kvm_guest_exit() before enabling interrupts, so I don't understand how
> >>> guest CPU time accounting works on those architectures.
> >>>
> >>> Changes since v1:
> >>>  - Tweak comment and commit text based on Marc's feedback.
> >>>  - Explicitly disable preemption and enable it only after kvm_guest_exit().
> >>>
> >>>  arch/arm/kvm/arm.c | 21 +++++++++++++++++----
> >>>  1 file changed, 17 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> >>> index e41cb11..fe8028d 100644
> >>> --- a/arch/arm/kvm/arm.c
> >>> +++ b/arch/arm/kvm/arm.c
> >>> @@ -532,6 +532,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  		kvm_vgic_flush_hwstate(vcpu);
> >>>  		kvm_timer_flush_hwstate(vcpu);
> >>>  
> >>> +		preempt_disable();
> >>>  		local_irq_disable();
> >>>  
> >>>  		/*
> >>> @@ -544,6 +545,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  
> >>>  		if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
> >>>  			local_irq_enable();
> >>> +			preempt_enable();
> >>>  			kvm_timer_sync_hwstate(vcpu);
> >>>  			kvm_vgic_sync_hwstate(vcpu);
> >>>  			continue;
> >>> @@ -559,8 +561,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  		ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
> >>>  
> >>>  		vcpu->mode = OUTSIDE_GUEST_MODE;
> >>> -		__kvm_guest_exit();
> >>> -		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> >>> +		/*
> >>> +		 * Back from guest
> >>> +		 *************************************************************/
> >>> +
> >>>  		/*
> >>>  		 * We may have taken a host interrupt in HYP mode (ie
> >>>  		 * while executing the guest). This interrupt is still
> >>> @@ -574,8 +578,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  		local_irq_enable();
> >>>  
> >>>  		/*
> >>> -		 * Back from guest
> >>> -		 *************************************************************/
> >>> +		 * We do local_irq_enable() before calling kvm_guest_exit() so
> >>> +		 * that if a timer interrupt hits while running the guest we
> >>> +		 * account that tick as being spent in the guest.  We enable
> >>> +		 * preemption after calling kvm_guest_exit() so that if we get
> >>> +		 * preempted we make sure ticks after that is not counted as
> >>> +		 * guest time.
> >>> +		 */
> >>> +		kvm_guest_exit();
> >>> +		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> >>> +		preempt_enable();
> >>> +
> >>>  
> >>>  		kvm_timer_sync_hwstate(vcpu);
> >>>  		kvm_vgic_sync_hwstate(vcpu);
> >>>
> >>
> >> Hi Christoffer,
> >>  so currently we take a snap shot when we enter the guest
> >> (tsk->vtime_snap) and upon exit add the time we spent in
> >> the guest and update accrued time, which appears correct.
> > 
> > not on arm64, because we don't select HAVE_VIRT_CPU_ACCOUNTING_GEN.  Or
> > am I missing something obvious here?
> I see what you mean we can't use cycle based accounting to accrue
> Guest time.
> 

See other thread, we can enable this in the config but it still only
works with NO_HZ_FULL.

> > 
> >>
> >> With this patch it appears that interrupts running
> >> in host mode are accrued to Guest time, and additional preemption
> >> latency is added.
> >>
> > It is true that interrupt processing in host mode (if they hit on a CPU
> > when it is running a guest) are accrued to guest time, but without this
> > patch on current arm64 we accrue no CPU time to guest time at all, which
> > is hardly more correct.
> Yes if only ticks are supported then it's definitely better!
> 
> Nevertheless with high interrupt rate it will complicate root causing
> issues, a lot of that time will go to guest.

That sounds like a separate fix to me; if there's an existing mechanism
to account for time spent on hw IRQs and it is somehow invalidated by
the PF_VCPU flag being set, then that feels wrong, at least how arm64
works, but it doesn't make this patch less correct.

> 
> > 
> > If this patch is incorrect, then how does it work on x86, where
> > handle_external_intr() is called (with a barrier in between) before
> > kvm_guest_exit(), and where handle_external_intr() is simply
> > local_irq_enable() on SVM and something more complicated on VMX ?
> > 
> > Finally, how exactly is preemption latency added here?   Won't IRQ
> > processing run with higher priority than any task on your system, so the
> > order of (1) process pending IRQs (2) call schedule if needed is still
> > preserved here, but we call kvm_guest_exit() between (1) and (2) instead
> > of before (1).
> 
> I may be missing something, but on return from interrupt with preempt
> disabled we can't take the need resched path. And need to return
> to KVM no?

preempt_enable() will call __preempt_schedule() and cause preemption
there, so you're talking about adding these lines of latency:

	kvm_guest_exit();
	trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));

And these were called with interrupts disabled before, so I don't see
the issue??

However, your question is making me think whether we have a race in the
current code on fully preemptible kernels, if we get preempted before
calling kvm_timer_sync_hwstate() and kvm_vgic_sync_hwstate(), then we
could potentially schedule another vcpu on this core and loose/corrupt
state, can we not?  We probably need to check for this in
kvm_vcpu_load/kvm_vcpu_put.  I need to think more about if this is a
real issue or if I'm seeing ghosts.

> > 
> > It is entirely possible that I'm missing the mark here and everything
> > gets solved by enabling HAVE_VIRT_CPU_ACCOUNTING_GEN or we need some
> > extra logic?
> 
> I think something to look into for us, taking a low issue to high level
> application - for state machine based type of applications (I guess like
> NFV) it cause problems  to root cause issues, a lot of activities
> run between ticks. For VM cycle based accounting is probably a must
> in that case.
> 
Would you run with NO_HZ_FULL in this case?  Because then we should just
enable HAVE_VIRT_CPU_ACCOUNTING_GEN, and I think that would be a good
start.

Thanks,
-Christoffer

From mboxrd@z Thu Jan  1 00:00:00 1970
From: christoffer.dall@linaro.org (Christoffer Dall)
Date: Tue, 2 Jun 2015 11:27:59 +0200
Subject: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
In-Reply-To: <556C7EC6.3010105@samsung.com>
References: <1432838950-28774-1-git-send-email-christoffer.dall@linaro.org>
 <5568E987.5010702@samsung.com> <20150531065923.GB763@cbox>
 <556C7EC6.3010105@samsung.com>
Message-ID: <20150602092759.GA7783@cbox>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On Mon, Jun 01, 2015 at 08:48:22AM -0700, Mario Smarduch wrote:
> On 05/30/2015 11:59 PM, Christoffer Dall wrote:
> > Hi Mario,
> > 
> > On Fri, May 29, 2015 at 03:34:47PM -0700, Mario Smarduch wrote:
> >> On 05/28/2015 11:49 AM, Christoffer Dall wrote:
> >>> Until now we have been calling kvm_guest_exit after re-enabling
> >>> interrupts when we come back from the guest, but this has the
> >>> unfortunate effect that CPU time accounting done in the context of timer
> >>> interrupts occurring while the guest is running doesn't properly notice
> >>> that the time since the last tick was spent in the guest.
> >>>
> >>> Inspired by the comment in the x86 code, move the kvm_guest_exit() call
> >>> below the local_irq_enable() call and change __kvm_guest_exit() to
> >>> kvm_guest_exit(), because we are now calling this function with
> >>> interrupts enabled.  We have to now explicitly disable preemption and
> >>> not enable preemption before we've called kvm_guest_exit(), since
> >>> otherwise we could be preempted and everything happening before we
> >>> eventually get scheduled again would be accounted for as guest time.
> >>>
> >>> At the same time, move the trace_kvm_exit() call outside of the atomic
> >>> section, since there is no reason for us to do that with interrupts
> >>> disabled.
> >>>
> >>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> >>> ---
> >>> This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
> >>> rework recently posted by Christian Borntraeger.  I hope I got the logic
> >>> of this right, there were 2 slightly worrying facts about this:
> >>>
> >>> First, we now enable and disable and enable interrupts on each exit
> >>> path, but I couldn't see any performance overhead on hackbench - yes the
> >>> only benchmark we care about.
> >>>
> >>> Second, looking at the ppc and mips code, they seem to also call
> >>> kvm_guest_exit() before enabling interrupts, so I don't understand how
> >>> guest CPU time accounting works on those architectures.
> >>>
> >>> Changes since v1:
> >>>  - Tweak comment and commit text based on Marc's feedback.
> >>>  - Explicitly disable preemption and enable it only after kvm_guest_exit().
> >>>
> >>>  arch/arm/kvm/arm.c | 21 +++++++++++++++++----
> >>>  1 file changed, 17 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> >>> index e41cb11..fe8028d 100644
> >>> --- a/arch/arm/kvm/arm.c
> >>> +++ b/arch/arm/kvm/arm.c
> >>> @@ -532,6 +532,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  		kvm_vgic_flush_hwstate(vcpu);
> >>>  		kvm_timer_flush_hwstate(vcpu);
> >>>  
> >>> +		preempt_disable();
> >>>  		local_irq_disable();
> >>>  
> >>>  		/*
> >>> @@ -544,6 +545,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  
> >>>  		if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
> >>>  			local_irq_enable();
> >>> +			preempt_enable();
> >>>  			kvm_timer_sync_hwstate(vcpu);
> >>>  			kvm_vgic_sync_hwstate(vcpu);
> >>>  			continue;
> >>> @@ -559,8 +561,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  		ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
> >>>  
> >>>  		vcpu->mode = OUTSIDE_GUEST_MODE;
> >>> -		__kvm_guest_exit();
> >>> -		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> >>> +		/*
> >>> +		 * Back from guest
> >>> +		 *************************************************************/
> >>> +
> >>>  		/*
> >>>  		 * We may have taken a host interrupt in HYP mode (ie
> >>>  		 * while executing the guest). This interrupt is still
> >>> @@ -574,8 +578,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  		local_irq_enable();
> >>>  
> >>>  		/*
> >>> -		 * Back from guest
> >>> -		 *************************************************************/
> >>> +		 * We do local_irq_enable() before calling kvm_guest_exit() so
> >>> +		 * that if a timer interrupt hits while running the guest we
> >>> +		 * account that tick as being spent in the guest.  We enable
> >>> +		 * preemption after calling kvm_guest_exit() so that if we get
> >>> +		 * preempted we make sure ticks after that is not counted as
> >>> +		 * guest time.
> >>> +		 */
> >>> +		kvm_guest_exit();
> >>> +		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> >>> +		preempt_enable();
> >>> +
> >>>  
> >>>  		kvm_timer_sync_hwstate(vcpu);
> >>>  		kvm_vgic_sync_hwstate(vcpu);
> >>>
> >>
> >> Hi Christoffer,
> >>  so currently we take a snap shot when we enter the guest
> >> (tsk->vtime_snap) and upon exit add the time we spent in
> >> the guest and update accrued time, which appears correct.
> > 
> > not on arm64, because we don't select HAVE_VIRT_CPU_ACCOUNTING_GEN.  Or
> > am I missing something obvious here?
> I see what you mean we can't use cycle based accounting to accrue
> Guest time.
> 

See other thread, we can enable this in the config but it still only
works with NO_HZ_FULL.

> > 
> >>
> >> With this patch it appears that interrupts running
> >> in host mode are accrued to Guest time, and additional preemption
> >> latency is added.
> >>
> > It is true that interrupt processing in host mode (if they hit on a CPU
> > when it is running a guest) are accrued to guest time, but without this
> > patch on current arm64 we accrue no CPU time to guest time at all, which
> > is hardly more correct.
> Yes if only ticks are supported then it's definitely better!
> 
> Nevertheless with high interrupt rate it will complicate root causing
> issues, a lot of that time will go to guest.

That sounds like a separate fix to me; if there's an existing mechanism
to account for time spent on hw IRQs and it is somehow invalidated by
the PF_VCPU flag being set, then that feels wrong, at least how arm64
works, but it doesn't make this patch less correct.

> 
> > 
> > If this patch is incorrect, then how does it work on x86, where
> > handle_external_intr() is called (with a barrier in between) before
> > kvm_guest_exit(), and where handle_external_intr() is simply
> > local_irq_enable() on SVM and something more complicated on VMX ?
> > 
> > Finally, how exactly is preemption latency added here?   Won't IRQ
> > processing run with higher priority than any task on your system, so the
> > order of (1) process pending IRQs (2) call schedule if needed is still
> > preserved here, but we call kvm_guest_exit() between (1) and (2) instead
> > of before (1).
> 
> I may be missing something, but on return from interrupt with preempt
> disabled we can't take the need resched path. And need to return
> to KVM no?

preempt_enable() will call __preempt_schedule() and cause preemption
there, so you're talking about adding these lines of latency:

	kvm_guest_exit();
	trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));

And these were called with interrupts disabled before, so I don't see
the issue??

However, your question is making me think whether we have a race in the
current code on fully preemptible kernels, if we get preempted before
calling kvm_timer_sync_hwstate() and kvm_vgic_sync_hwstate(), then we
could potentially schedule another vcpu on this core and loose/corrupt
state, can we not?  We probably need to check for this in
kvm_vcpu_load/kvm_vcpu_put.  I need to think more about if this is a
real issue or if I'm seeing ghosts.

> > 
> > It is entirely possible that I'm missing the mark here and everything
> > gets solved by enabling HAVE_VIRT_CPU_ACCOUNTING_GEN or we need some
> > extra logic?
> 
> I think something to look into for us, taking a low issue to high level
> application - for state machine based type of applications (I guess like
> NFV) it cause problems  to root cause issues, a lot of activities
> run between ticks. For VM cycle based accounting is probably a must
> in that case.
> 
Would you run with NO_HZ_FULL in this case?  Because then we should just
enable HAVE_VIRT_CPU_ACCOUNTING_GEN, and I think that would be a good
start.

Thanks,
-Christoffer