Re: [PATCH 0/2] KVM: x86: Fix and cleanup for recent AVIC changes

From: Sean Christopherson <seanjc@google.com>
To: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/2] KVM: x86: Fix and cleanup for recent AVIC changes
Date: Mon, 11 Oct 2021 16:58:06 +0000	[thread overview]
Message-ID: <YWRtHmAUaKcbWEzH@google.com> (raw)
In-Reply-To: <YWRJwZF1toUuyBdC@google.com>

On Mon, Oct 11, 2021, Sean Christopherson wrote:
> On Sun, Oct 10, 2021, Maxim Levitsky wrote:
> > On Fri, 2021-10-08 at 18:01 -0700, Sean Christopherson wrote:
> > > Belated "code review" for Maxim's recent series to rework the AVIC inhibit
> > > code.  Using the global APICv status in the page fault path is wrong as
> > > the correct status is always the vCPU's, since that status is accurate
> > > with respect to the time of the page fault.  In a similar vein, the code
> > > to change the inhibit can be cleaned up since KVM can't rely on ordering
> > > between the update and the request for anything except consumers of the
> > > request.
> > > 
> > > Sean Christopherson (2):
> > >   KVM: x86/mmu: Use vCPU's APICv status when handling APIC_ACCESS
> > >     memslot
> > >   KVM: x86: Simplify APICv update request logic
> > > 
> > >  arch/x86/kvm/mmu/mmu.c |  2 +-
> > >  arch/x86/kvm/x86.c     | 16 +++++++---------
> > >  2 files changed, 8 insertions(+), 10 deletions(-)
> > > 
> > 
> > Are you sure about it? Let me explain how the algorithm works:
> > 
> > - kvm_request_apicv_update:
> > 
> > 	- take kvm->arch.apicv_update_lock
> > 
> > 	- if inhibition state doesn't really change (kvm->arch.apicv_inhibit_reasons still zero or non zero)
> > 		- update kvm->arch.apicv_inhibit_reasons
> > 		- release the lock
> > 
> > 	- raise KVM_REQ_APICV_UPDATE
> > 		* since kvm->arch.apicv_update_lock is taken, all vCPUs will be
> > 		kicked out of guest mode and will be either doing someing in
> > 		the KVM (like page fault) or stuck on trying to process that
> > 		request the important thing is that no vCPU will be able to get
> > 		back to the guest mode.
> > 
> > 	- update the kvm->arch.apicv_inhibit_reasons
> > 		* since we hold vm->arch.apicv_update_lock vcpus can't see the new value
> 
> This assertion is incorrect, kvm_apicv_activated() is not guarded by the lock.
> 
> > 	- update the SPTE that covers the APIC's mmio window:
> 
> This won't affect in-flight page faults.
> 
> 
>    vCPU0                               vCPU1
>    =====                               =====
>    Disabled APICv
>    #NPT                                Acquire apicv_update_lock
>                                        Re-enable APICv
>    kvm_apicv_activated() == false

Doh, that's supposed to be "true".

>    incorrectly handle as regular MMIO
>                                        zap APIC pages
>    MMIO cache has bad entry

Argh, I forgot the memslot is still there, so the access won't be treated as MMIO
and thus won't end up in the MMIO cache.

So I agree that the code is functionally ok, but I'd still prefer to switch to
kvm_vcpu_apicv_active() so that this code is coherent with respect to the APICv
status at the time the fault occurred.

My objection to using kvm_apicv_activated() is that the result is completely
non-deterministic with respect to the vCPU's APICv status at the time of the
fault.  It works because all of the other mechanisms that are in place, e.g.
elevating the MMU notifier count, but the fact that the result is non-deterministic
means that using the per-vCPU status is also functionally ok.

At a minimum, I'd like to add a blurb in the kvm_faultin_pfn() comment to call out
the reliance on mmu_notifier_seq.

E.g. if kvm_zap_gfn_range() wins the race to acquire mmu_lock() after APICv is
inhibited/disabled by __kvm_request_apicv_update(), then direct_page_fault() will
retry the fault due to the change in mmu_notifier_seq.  If direct_page_fault()
wins the race, then kvm_zap_gfn_range() will zap the freshly-installed SPTEs.
For the uninhibit/enable case, at worst KVM will emulate an access that could have
been accelerated by retrying the instruction.