Re: [PATCH] KVM: x86/pmu: SRCU protect the PMU event filter in the fast path

From: Sean Christopherson <seanjc@google.com>
To: Aaron Lewis <aaronlewis@google.com>
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com
Subject: Re: [PATCH] KVM: x86/pmu: SRCU protect the PMU event filter in the fast path
Date: Mon, 26 Jun 2023 10:34:14 -0700	[thread overview]
Message-ID: <ZJnMFq+BQF46NGut@google.com> (raw)
In-Reply-To: <CAAAPnDEb0dwdWsF6K9s1r=gZSQHXwo5Y8U9FWGzX52_KSFk_hw@mail.gmail.com>

On Mon, Jun 26, 2023, Aaron Lewis wrote:
> As a separate issue, shouldn't we restrict the MSR filter from being
> able to intercept MSRs handled by the fast path?  I see that we do
> that for the APIC MSRs, but if MSR_IA32_TSC_DEADLINE is handled by the
> fast path, I don't see a way for userspace to override that behavior.
> So maybe it shouldn't?  E.g.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 439312e04384..dd0a314da0a3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1787,7 +1787,7 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32
> index, u32 type)
>         u32 i;
> 
>         /* x2APIC MSRs do not support filtering. */
> -       if (index >= 0x800 && index <= 0x8ff)
> +       if (index >= 0x800 && index <= 0x8ff || index == MSR_IA32_TSC_DEADLINE)
>                 return true;
> 
>         idx = srcu_read_lock(&kvm->srcu);

Yeah, I saw that flaw too :-/  I'm not entirely sure what to do about MSRs that
can be handled in the fastpath.

On one hand, intercepting those MSRs probably doesn't make much sense.  On the
other hand, the MSR filter needs to be uABI, i.e. we can't make the statement
"MSRs handled in KVM's fastpath can't be filtered", because either every new
fastpath MSRs will potentially break userspace, or KVM will be severely limited
with respect to what can be handled in the fastpath.

From an ABI perspective, the easiest thing is to fix the bug and enforce any
filter that affects MSR_IA32_TSC_DEADLINE.  If we ignore performance, the fix is
trivial.  E.g.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5f220c04624e..3ef903bb78ce 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2174,6 +2174,9 @@ fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu)
 
        kvm_vcpu_srcu_read_lock(vcpu);
 
+       if (!kvm_msr_allowed(vcpu, msr, KVM_MSR_FILTER_WRITE))
+               goto out;
+
        switch (msr) {
        case APIC_BASE_MSR + (APIC_ICR >> 4):
                data = kvm_read_edx_eax(vcpu);
@@ -2196,6 +2199,7 @@ fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu)
        if (ret != EXIT_FASTPATH_NONE)
                trace_kvm_msr_write(msr, data);
 
+out:
        kvm_vcpu_srcu_read_unlock(vcpu);
 
        return ret;

But I don't love the idea of searching through the filters for an MSR that is
pretty much guaranteed to be allowed.  Since x2APIC MSRs can't be filtered, we
could add a per-vCPU flag to track if writes to TSC_DEADLINE are allowed, i.e.
if TSC_DEADLINE can be handled in the fastpath.

However, at some point Intel and/or AMD will (hopefully) add support for full
virtualization of TSC_DEADLINE, and then TSC_DEADLINE will be in the same boat as
the x2APIC MSRs, i.e. allowing userspace to filter TSC_DEADLINE when it's fully
virtualized would be nonsensical.  And depending on how hardware behaves, i.e. how
a virtual TSC_DEADLINE interacts with the MSR bitmaps, *enforcing* userspace's
filtering might require a small amount of additional complexity.

And any MSR that is performance sensitive enough to be handled in the fastpath is
probably worth virtualizing in hardware, i.e. we'll end up revisiting this topic
every time we add an MSR to the fastpath :-(

I'm struggling to come up with an idea that won't create an ABI nightmare, won't
be subject to the whims of AMD and Intel, and won't saddle KVM with complexity to
support behavior that in all likelihood no one wants.

I'm leaning toward enforcing the filter for TSC_DEADLINE, and crossing my fingers
that neither AMD nor Intel implements TSC_DEADLINE virtualization in such a way
that it changes the behavior of WRMSR interception.