Re: [PATCH 3/3] KVM: LAPIC: Optimize PMI delivering overhead

From: Wanpeng Li <kernellwp@gmail.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>, kvm <kvm@vger.kernel.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>
Subject: Re: [PATCH 3/3] KVM: LAPIC: Optimize PMI delivering overhead
Date: Sat, 9 Oct 2021 17:14:05 +0800	[thread overview]
Message-ID: <CANRm+Czj4Kv56HcX2vYu6mMa6o6xrMrCKmZ8x=rp-apLrrGHZQ@mail.gmail.com> (raw)
In-Reply-To: <YWBq56G/ZrsytEP7@google.com>

On Fri, 8 Oct 2021 at 23:59, Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Oct 08, 2021, Wanpeng Li wrote:
> > On Fri, 8 Oct 2021 at 18:52, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> > >
> > > Wanpeng Li <kernellwp@gmail.com> writes:
> > >
> > > > From: Wanpeng Li <wanpengli@tencent.com>
> > > >
> > > > The overhead of kvm_vcpu_kick() is huge since expensive rcu/memory
> > > > barrier etc operations in rcuwait_wake_up(). It is worse when local
>
> Memory barriers on x86 are just compiler barriers.  The only meaningful overhead
> is the locked transaction in rcu_read_lock() => preempt_disable().  I suspect the
> performance benefit from this patch comes either comes from avoiding a second
> lock when disabling preemption again for get_cpu(), or by avoiding the cmpxchg()
> in kvm_vcpu_exiting_guest_mode().
>
> > > > delivery since the vCPU is scheduled and we still suffer from this.
> > > > We can observe 12us+ for kvm_vcpu_kick() in kvm_pmu_deliver_pmi()
> > > > path by ftrace before the patch and 6us+ after the optimization.
>
> Those numbers seem off, I wouldn't expect a few locks to take 6us.

Maybe the ftrace introduces more overhead.

>
> > > > Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> > > > ---
> > > >  arch/x86/kvm/lapic.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > > index 76fb00921203..ec6997187c6d 100644
> > > > --- a/arch/x86/kvm/lapic.c
> > > > +++ b/arch/x86/kvm/lapic.c
> > > > @@ -1120,7 +1120,8 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
> > > >       case APIC_DM_NMI:
> > > >               result = 1;
> > > >               kvm_inject_nmi(vcpu);
> > > > -             kvm_vcpu_kick(vcpu);
> > > > +             if (vcpu != kvm_get_running_vcpu())
> > > > +                     kvm_vcpu_kick(vcpu);
> > >
> > > Out of curiosity,
> > >
> > > can this be converted into a generic optimization for kvm_vcpu_kick()
> > > instead? I.e. if kvm_vcpu_kick() is called for the currently running
> > > vCPU, there's almost nothing to do, especially when we already have a
> > > request pending, right? (I didn't put too much though to it)
> >
> > I thought about it before, I will do it in the next version since you
> > also vote for it. :)
>
> Adding a kvm_get_running_vcpu() check before kvm_vcpu_wake_up() in kvm_vcpu_kick()
> is not functionally correct as it's possible to reach kvm_cpu_kick() from (soft)
> IRQ context, e.g. hrtimer => apic_timer_expired() and pi_wakeup_handler().  If
> the kick occurs after prepare_to_rcuwait() and the final kvm_vcpu_check_block(),
> but before the vCPU is scheduled out, then the kvm_vcpu_wake_up() is required to
> wake the vCPU, even if it is the current running vCPU.

Good point.

>
> The extra check might also degrade performance for many cases since the full kick
> path would need to disable preemption three times, though if the overhead is from
> x86's cmpxchg() then it's a moot point.
>
> I think we'd want something like this to avoid extra preempt_disable() as well
> as the cmpxchg() when @vcpu is the running vCPU.

Do it in v2, thanks for the suggestion.

    Wanpeng