[v3,3/9] KVM: x86: Defer tick-based accounting 'til after IRQ handling
diff mbox series

Message ID 20210415222106.1643837-4-seanjc@google.com
State New
Headers show
Series
  • KVM: Fix tick-based accounting for x86 guests
Related show

Commit Message

Sean Christopherson April 15, 2021, 10:21 p.m. UTC
From: Wanpeng Li <wanpengli@tencent.com>

When using tick-based accounting, defer the call to account guest time
until after servicing any IRQ(s) that happened in the guest or
immediately after VM-Exit.  Tick-based accounting of vCPU time relies on
PF_VCPU being set when the tick IRQ handler runs, and IRQs are blocked
throughout {svm,vmx}_vcpu_enter_exit().

This fixes a bug[*] where reported guest time remains '0', even when
running an infinite loop in the guest.

[*] https://bugzilla.kernel.org/show_bug.cgi?id=209831

Fixes: 87fa7f3e98a131 ("x86/kvm: Move context tracking where it belongs")
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Michael Tokarev <mjt@tls.msk.ru>
Cc: stable@vger.kernel.org#v5.9-rc1+
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/svm.c | 13 ++++++++++---
 arch/x86/kvm/vmx/vmx.c | 13 ++++++++++---
 arch/x86/kvm/x86.c     |  8 ++++++++
 3 files changed, 28 insertions(+), 6 deletions(-)

Comments

Frederic Weisbecker April 20, 2021, 11:14 p.m. UTC | #1
On Thu, Apr 15, 2021 at 03:21:00PM -0700, Sean Christopherson wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> When using tick-based accounting, defer the call to account guest time
> until after servicing any IRQ(s) that happened in the guest or
> immediately after VM-Exit.  Tick-based accounting of vCPU time relies on
> PF_VCPU being set when the tick IRQ handler runs, and IRQs are blocked
> throughout {svm,vmx}_vcpu_enter_exit().
> 
> This fixes a bug[*] where reported guest time remains '0', even when
> running an infinite loop in the guest.
> 
> [*] https://bugzilla.kernel.org/show_bug.cgi?id=209831
> 
> Fixes: 87fa7f3e98a131 ("x86/kvm: Move context tracking where it belongs")
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Michael Tokarev <mjt@tls.msk.ru>
> Cc: stable@vger.kernel.org#v5.9-rc1+
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 16fb39503296..e4d475df1d4a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9230,6 +9230,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	local_irq_disable();
>  	kvm_after_interrupt(vcpu);
>  
> +	/*
> +	 * When using tick-based accounting, wait until after servicing IRQs to
> +	 * account guest time so that any ticks that occurred while running the
> +	 * guest are properly accounted to the guest.
> +	 */
> +	if (!vtime_accounting_enabled_this_cpu())
> +		vtime_account_guest_exit();

Can we rather have instead:

static inline void tick_account_guest_exit(void)
{
	if (!vtime_accounting_enabled_this_cpu())
		current->flags &= ~PF_VCPU;
}

It duplicates a bit of code but I think this will read less confusing.

Thanks.

> +
>  	if (lapic_in_kernel(vcpu)) {
>  		s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
>  		if (delta != S64_MIN) {
> -- 
> 2.31.1.368.gbe11c130af-goog
>
Sean Christopherson April 20, 2021, 11:26 p.m. UTC | #2
On Wed, Apr 21, 2021, Frederic Weisbecker wrote:
> On Thu, Apr 15, 2021 at 03:21:00PM -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 16fb39503296..e4d475df1d4a 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9230,6 +9230,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  	local_irq_disable();
> >  	kvm_after_interrupt(vcpu);
> >  
> > +	/*
> > +	 * When using tick-based accounting, wait until after servicing IRQs to
> > +	 * account guest time so that any ticks that occurred while running the
> > +	 * guest are properly accounted to the guest.
> > +	 */
> > +	if (!vtime_accounting_enabled_this_cpu())
> > +		vtime_account_guest_exit();
> 
> Can we rather have instead:
> 
> static inline void tick_account_guest_exit(void)
> {
> 	if (!vtime_accounting_enabled_this_cpu())
> 		current->flags &= ~PF_VCPU;
> }
> 
> It duplicates a bit of code but I think this will read less confusing.

Either way works for me.  I used vtime_account_guest_exit() to try to keep as
many details as possible inside vtime, e.g. in case the implemenation is tweaked
in the future.  But I agree that pretending KVM isn't already deeply intertwined
with the details is a lie.
Frederic Weisbecker April 21, 2021, 10:07 a.m. UTC | #3
On Thu, Apr 15, 2021 at 03:21:00PM -0700, Sean Christopherson wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> When using tick-based accounting, defer the call to account guest time
> until after servicing any IRQ(s) that happened in the guest or
> immediately after VM-Exit.  Tick-based accounting of vCPU time relies on
> PF_VCPU being set when the tick IRQ handler runs, and IRQs are blocked
> throughout {svm,vmx}_vcpu_enter_exit().

Out of curiosity, how has guest accounting ever managed to work if we have always
cleared PF_VCPU before re-enabling interrupt?

Thanks.
Frederic Weisbecker April 21, 2021, 10:11 a.m. UTC | #4
On Tue, Apr 20, 2021 at 11:26:34PM +0000, Sean Christopherson wrote:
> On Wed, Apr 21, 2021, Frederic Weisbecker wrote:
> > On Thu, Apr 15, 2021 at 03:21:00PM -0700, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 16fb39503296..e4d475df1d4a 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -9230,6 +9230,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >  	local_irq_disable();
> > >  	kvm_after_interrupt(vcpu);
> > >  
> > > +	/*
> > > +	 * When using tick-based accounting, wait until after servicing IRQs to
> > > +	 * account guest time so that any ticks that occurred while running the
> > > +	 * guest are properly accounted to the guest.
> > > +	 */
> > > +	if (!vtime_accounting_enabled_this_cpu())
> > > +		vtime_account_guest_exit();
> > 
> > Can we rather have instead:
> > 
> > static inline void tick_account_guest_exit(void)
> > {
> > 	if (!vtime_accounting_enabled_this_cpu())
> > 		current->flags &= ~PF_VCPU;
> > }
> > 
> > It duplicates a bit of code but I think this will read less confusing.
> 
> Either way works for me.  I used vtime_account_guest_exit() to try to keep as
> many details as possible inside vtime, e.g. in case the implemenation is tweaked
> in the future.  But I agree that pretending KVM isn't already deeply intertwined
> with the details is a lie.

Ok let's keep it as is then. It reads funny but can we perhaps hope that
other archs have the same issue and in the future we'll need to have this split
generalized outside x86 with:

	* guest_exit_vtime()
	* guest_exit_tick()
Frederic Weisbecker April 21, 2021, 12:19 p.m. UTC | #5
On Tue, Apr 20, 2021 at 11:26:34PM +0000, Sean Christopherson wrote:
> On Wed, Apr 21, 2021, Frederic Weisbecker wrote:
> > On Thu, Apr 15, 2021 at 03:21:00PM -0700, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 16fb39503296..e4d475df1d4a 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -9230,6 +9230,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >  	local_irq_disable();
> > >  	kvm_after_interrupt(vcpu);
> > >  
> > > +	/*
> > > +	 * When using tick-based accounting, wait until after servicing IRQs to
> > > +	 * account guest time so that any ticks that occurred while running the
> > > +	 * guest are properly accounted to the guest.
> > > +	 */
> > > +	if (!vtime_accounting_enabled_this_cpu())
> > > +		vtime_account_guest_exit();
> > 
> > Can we rather have instead:
> > 
> > static inline void tick_account_guest_exit(void)
> > {
> > 	if (!vtime_accounting_enabled_this_cpu())
> > 		current->flags &= ~PF_VCPU;
> > }
> > 
> > It duplicates a bit of code but I think this will read less confusing.
> 
> Either way works for me.  I used vtime_account_guest_exit() to try to keep as
> many details as possible inside vtime, e.g. in case the implemenation is tweaked
> in the future.  But I agree that pretending KVM isn't already deeply intertwined
> with the details is a lie.

Ah I see, before 87fa7f3e98a131 the vtime was accounted after interrupts get
processed. So it used to work until then. I see that ARM64 waits for IRQs to
be enabled too.

PPC/book3s_hv, MIPS, s390 do it before IRQs get re-enabled (weird, how does that
work?)

And PPC/book3s_pr calls guest_exit() so I guess it has interrupts enabled.

The point is: does it matter to call vtime_account_guest_exit() before or
after interrupts? If it doesn't matter, we can simply call
vtime_account_guest_exit() once and for all once IRQs are re-enabled.

If it does matter because we don't want to account the host IRQs firing at the
end of vcpu exit, then probably we should standardize that behaviour and have
guest_exit_vtime() called before interrupts get enabled and guest_exit_tick()
called after interrupts get enabled. It's probably then beyond the scope of this
patchset but I would like to poke your opinion on that.

Thanks.
Sean Christopherson April 28, 2021, 10:38 p.m. UTC | #6
Apologies for the slow response.

On Wed, Apr 21, 2021, Frederic Weisbecker wrote:
> On Tue, Apr 20, 2021 at 11:26:34PM +0000, Sean Christopherson wrote:
> > On Wed, Apr 21, 2021, Frederic Weisbecker wrote:
> > > On Thu, Apr 15, 2021 at 03:21:00PM -0700, Sean Christopherson wrote:
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 16fb39503296..e4d475df1d4a 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -9230,6 +9230,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > > >  	local_irq_disable();
> > > >  	kvm_after_interrupt(vcpu);
> > > >  
> > > > +	/*
> > > > +	 * When using tick-based accounting, wait until after servicing IRQs to
> > > > +	 * account guest time so that any ticks that occurred while running the
> > > > +	 * guest are properly accounted to the guest.
> > > > +	 */
> > > > +	if (!vtime_accounting_enabled_this_cpu())
> > > > +		vtime_account_guest_exit();
> > > 
> > > Can we rather have instead:
> > > 
> > > static inline void tick_account_guest_exit(void)
> > > {
> > > 	if (!vtime_accounting_enabled_this_cpu())
> > > 		current->flags &= ~PF_VCPU;
> > > }
> > > 
> > > It duplicates a bit of code but I think this will read less confusing.
> > 
> > Either way works for me.  I used vtime_account_guest_exit() to try to keep as
> > many details as possible inside vtime, e.g. in case the implemenation is tweaked
> > in the future.  But I agree that pretending KVM isn't already deeply intertwined
> > with the details is a lie.
> 
> Ah I see, before 87fa7f3e98a131 the vtime was accounted after interrupts get
> processed. So it used to work until then. I see that ARM64 waits for IRQs to
> be enabled too.
> 
> PPC/book3s_hv, MIPS, s390 do it before IRQs get re-enabled (weird, how does that
> work?)

No idea.  It's entirely possible it doesn't work on one or more of those
architectures.

Based on init/Kconfig, s390 doesn't support tick-based accounting, so I assume
s390 is ok.

  config TICK_CPU_ACCOUNTING
	bool "Simple tick based cputime accounting"
	depends on !S390 && !NO_HZ_FULL

> And PPC/book3s_pr calls guest_exit() so I guess it has interrupts enabled.
> 
> The point is: does it matter to call vtime_account_guest_exit() before or
> after interrupts? If it doesn't matter, we can simply call
> vtime_account_guest_exit() once and for all once IRQs are re-enabled.
> 
> If it does matter because we don't want to account the host IRQs firing at the
> end of vcpu exit, then probably we should standardize that behaviour and have
> guest_exit_vtime() called before interrupts get enabled and guest_exit_tick()
> called after interrupts get enabled. It's probably then beyond the scope of this
> patchset but I would like to poke your opinion on that.
> 
> Thanks.

I don't know.  For x86, I would be ok with simply moving the call to
vtime_account_guest_exit() to after IRQs are enabled.  It would bug me a little
bit that KVM _could_ be more precise when running with
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y, and KVM would still be poking into the details
of vtime_account_guest_exit() to some extent, but overall it would be an
improvement from a code cleanliness perspective.

The problem is I have no clue who, if anyone, deploys KVM on x86 with
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y.  On the other hand, AMD/SVM has always had the
"inaccurate" accounting, and Intel/VMX has been inaccurate since commit
d7a08882a0a4 ("KVM: x86: Unconditionally enable irqs in guest context"), which
amusingly was a fix for an edge case in tick-based accounting.

Anyone have an opinion either way?  I'm very tempted to go with Frederic's
suggestion of moving the time accounting back to where it was, it makes KVM just
a little less ugly.

Patch
diff mbox series

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 48b396f33bee..bb2aa0dde7c5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3750,17 +3750,24 @@  static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 	 * have them in state 'on' as recorded before entering guest mode.
 	 * Same as enter_from_user_mode().
 	 *
-	 * guest_exit_irqoff() restores host context and reinstates RCU if
-	 * enabled and required.
+	 * context_tracking_guest_exit_irqoff() restores host context and
+	 * reinstates RCU if enabled and required.
 	 *
 	 * This needs to be done before the below as native_read_msr()
 	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
 	 * into world and some more.
 	 */
 	lockdep_hardirqs_off(CALLER_ADDR0);
-	guest_exit_irqoff();
+	context_tracking_guest_exit_irqoff();
 
 	instrumentation_begin();
+	/*
+	 * Account guest time when precise accounting based on context tracking
+	 * is enabled.  Tick based accounting is deferred until after IRQs that
+	 * occurred within the VM-Enter/VM-Exit "window" are handled.
+	 */
+	if (vtime_accounting_enabled_this_cpu())
+		vtime_account_guest_exit();
 	trace_hardirqs_off_finish();
 	instrumentation_end();
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c05e6e2854b5..5ae9dc197048 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6639,17 +6639,24 @@  static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	 * have them in state 'on' as recorded before entering guest mode.
 	 * Same as enter_from_user_mode().
 	 *
-	 * guest_exit_irqoff() restores host context and reinstates RCU if
-	 * enabled and required.
+	 * context_tracking_guest_exit_irqoff() restores host context and
+	 * reinstates RCU if enabled and required.
 	 *
 	 * This needs to be done before the below as native_read_msr()
 	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
 	 * into world and some more.
 	 */
 	lockdep_hardirqs_off(CALLER_ADDR0);
-	guest_exit_irqoff();
+	context_tracking_guest_exit_irqoff();
 
 	instrumentation_begin();
+	/*
+	 * Account guest time when precise accounting based on context tracking
+	 * is enabled.  Tick based accounting is deferred until after IRQs that
+	 * occurred within the VM-Enter/VM-Exit "window" are handled.
+	 */
+	if (vtime_accounting_enabled_this_cpu())
+		vtime_account_guest_exit();
 	trace_hardirqs_off_finish();
 	instrumentation_end();
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 16fb39503296..e4d475df1d4a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9230,6 +9230,14 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	local_irq_disable();
 	kvm_after_interrupt(vcpu);
 
+	/*
+	 * When using tick-based accounting, wait until after servicing IRQs to
+	 * account guest time so that any ticks that occurred while running the
+	 * guest are properly accounted to the guest.
+	 */
+	if (!vtime_accounting_enabled_this_cpu())
+		vtime_account_guest_exit();
+
 	if (lapic_in_kernel(vcpu)) {
 		s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
 		if (delta != S64_MIN) {