Re: The vcpu won't be wakened for a long time

From: Sean Christopherson <seanjc@google.com>
To: "Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.)" 
	<longpeng2@huawei.com>
Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"Gonglei (Arei)" <arei.gonglei@huawei.com>,
	Huangzhichao <huangzhichao@huawei.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: The vcpu won't be wakened for a long time
Date: Tue, 21 Dec 2021 15:27:01 +0000	[thread overview]
Message-ID: <YcHyReHoF+qjIVTy@google.com> (raw)
In-Reply-To: <8a1a3ac75a6e4acf9bd1ce9779835e1c@huawei.com>

On Sat, Dec 18, 2021, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> > Hmm, that strongly suggests the "vcpu != kvm_get_running_vcpu()" is at fault.
> > Can you try running with the below commit?  It's currently sitting in kvm/queue,
> > but not marked for stable because I didn't think it was possible for the check
> > to a cause a missed wake event in KVM's current code base.
> > 
> 
> The below commit can fix the bug, we have just completed  the tests.
> Thanks.

Aha!  Somehow I missed this call chain when analyzing the change.

  irqfd_wakeup()
  |
  |->kvm_arch_set_irq_inatomic()
     |
     |-> kvm_irq_delivery_to_apic_fast()
         |
	 |-> kvm_apic_set_irq()

Paolo, can the changelog be amended to the below, and maybe even pull the commit
into 5.16?

KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU

Drop a check that guards triggering a posted interrupt on the currently
running vCPU, and more importantly guards waking the target vCPU if
triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
If a vIRQ is delivered from asynchronous context, the target vCPU can be
the currently running vCPU and can also be blocking, in which case
skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to
be a wake event for the vCPU.

The "do nothing" logic when "vcpu == running_vcpu" mostly works only
because the majority of calls to ->deliver_posted_interrupt(), especially
when using posted interrupts, come from synchronous KVM context.  But if
a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ
and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to
use posted interrupts, wake events from the device will be delivered to
KVM from IRQ context, e.g.

  vfio_msihandler()
  |
  |-> eventfd_signal()
      |
      |-> ...
          |
          |->  irqfd_wakeup()
               |
               |->kvm_arch_set_irq_inatomic()
                  |
                  |-> kvm_irq_delivery_to_apic_fast()
                      |
                      |-> kvm_apic_set_irq()

This also aligns the non-nested and nested usage of triggering posted
interrupts, and will allow for additional cleanups.

Fixes: 379a3c8ee444 ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath")
Cc: stable@vger.kernel.org
Reported-by: Longpeng (Mike) <longpeng2@huawei.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

> > commit 6a8110fea2c1b19711ac1ef718680dfd940363c6
> > Author: Sean Christopherson <seanjc@google.com>
> > Date:   Wed Dec 8 01:52:27 2021 +0000
> > 
> >     KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU
> > 
> >     Drop a check that guards triggering a posted interrupt on the currently
> >     running vCPU, and more importantly guards waking the target vCPU if
> >     triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
> >     The "do nothing" logic when "vcpu == running_vcpu" works only because KVM
> >     doesn't have a path to ->deliver_posted_interrupt() from asynchronous
> >     context, e.g. if apic_timer_expired() were changed to always go down the
> >     posted interrupt path for APICv, or if the IN_GUEST_MODE check in
> >     kvm_use_posted_timer_interrupt() were dropped, and the hrtimer fired in
> >     kvm_vcpu_block() after the final kvm_vcpu_check_block() check, the vCPU
> >     would be scheduled() out without being awakened, i.e. would "miss" the
> >     timer interrupt.
> > 
> >     One could argue that invoking kvm_apic_local_deliver() from (soft) IRQ
> >     context for the current running vCPU should be illegal, but nothing in
> >     KVM actually enforces that rules.  There's also no strong obvious benefit
> >     to making such behavior illegal, e.g. checking IN_GUEST_MODE and calling
> >     kvm_vcpu_wake_up() is at worst marginally more costly than querying the
> >     current running vCPU.
> > 
> >     Lastly, this aligns the non-nested and nested usage of triggering posted
> >     interrupts, and will allow for additional cleanups.
> > 
> >     Signed-off-by: Sean Christopherson <seanjc@google.com>
> >     Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> >     Message-Id: <20211208015236.1616697-18-seanjc@google.com>
> >     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 38749063da0e..f61a6348cffd 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -3995,8 +3995,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu
> > *vcpu, int vector)
> >          * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
> >          * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
> >          */
> > -       if (vcpu != kvm_get_running_vcpu() &&
> > -           !kvm_vcpu_trigger_posted_interrupt(vcpu, false))
> > +       if (!kvm_vcpu_trigger_posted_interrupt(vcpu, false))
> >                 kvm_vcpu_wake_up(vcpu);
> > 
> >         return 0;