Re: [PATCH] KVM: x86: Fix the initial value of mcg_cap

From: Xiaoyao Li <xiaoyao.li@intel.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: x86: Fix the initial value of mcg_cap
Date: Fri, 21 Oct 2022 11:12:57 +0800	[thread overview]
Message-ID: <b40fd338-cb3b-b602-0059-39f775e77ad6@intel.com> (raw)
In-Reply-To: <Y1F4AoeOhNFQnHnJ@google.com>

On 10/21/2022 12:32 AM, Sean Christopherson wrote:
> On Thu, Oct 20, 2022, Xiaoyao Li wrote:
>> On 10/20/2022 10:27 PM, Sean Christopherson wrote:
>>> On Thu, Oct 20, 2022, Xiaoyao Li wrote:
>>>> vcpu->arch.mcg_cap represents the value of MSR_IA32_MCG_CAP. It's
>>>> set via ioctl(KVM_X86_SETUP_MCE) from userspace when exposing and
>>>> configuring MCE to guest.
>>>>
>>>> It's wrong to leave the default value as KVM_MAX_MCE_BANKS.
>>>
>>> Why?  I agree it's an odd default, but the whole MCE API is odd.  Functionally,
>>> I don't see anything that's broken by allowing the guest to access the MCx_CTL MSRs
>>> by default.
>>
>> Yes. Allowing the access doesn't cause any issue for a VM.
>>
>> However, for the perspective of virtualization. It virtualizes a magic
>> hardware that even CPUID.MCA/MCE is not advertised and MCE is not set up by
>> userspace, guest is told there are 32 banks and all the banks can be
>> accessed.
> 
> '0' isn't necessarily better though, e.g. if userspace parrots back KVM's "supported"
> CPUID without invoking KVM_X86_SETUP_MCE, then it's equally odd that the guest will
> see no supported MCE MSRS.
> 
> Older versions of the SDM also state (or at least very strongly imply) that banks
> 0-3 are always available on P6.
> 
> Bank 0 is an especially weird case, as several of the MSRs are aliased to other
> MSRs that predate the machine check architecture.
> 
> Anyways, if this were newly introduced code I'd be all for defaulting to '0', but
> KVM has defaulted to KVM_MAX_MCE_BANKS since KVM_X86_SETUP_MCE was added way back
> in 2009.  Unless there's a bug that's fixed by this, I'm inclined to keep the
> current behavior even though it's weird, as hiding all MCE MSRs by default could
> theoretically cause a regression, e.g. by triggering #GP on MSRs that an older
> guest expects to always exist.

fair enough.

> If we really want to clean up this code, I think the correct approach would be to
> inject #GP on all relevant MSRs if CPUID.MCA==0, e.g.

It's what I thought of as well. But I didn't find any statement in SDM 
of "Accessing Machine Check MSRs gets #GP if no CPUID.MCA"

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4bd5f8a751de..97fafd851d8d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3260,6 +3260,9 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>          u64 data = msr_info->data;
>          u32 offset, last_msr;
>   
> +       if (!msr_info->host_initiated && !guest_cpuid_has(X86_FEATURE_MCA))
> +               return 1;
> +
>          switch (msr) {
>          case MSR_IA32_MCG_STATUS:
>                  vcpu->arch.mcg_status = data;
> @@ -3891,6 +3894,14 @@ static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata, bool host)
>          unsigned bank_num = mcg_cap & 0xff;
>          u32 offset, last_msr;
>   
> +       if (msr == MSR_IA32_P5_MC_ADDR || msr == MSR_IA32_P5_MC_TYPE) {
> +               *pdata = 0;
> +               return 0;
> +       }
> +
> +       if (!host && !guest_cpuid_has(X86_FEATURE_MCA))
> +               return 1;
> +
>          switch (msr) {
>          case MSR_IA32_P5_MC_ADDR:
>          case MSR_IA32_P5_MC_TYPE:
> 
> Or alternatively, this should work too:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4bd5f8a751de..e4a44d7af0a6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3774,6 +3774,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>          case MSR_IA32_MCG_STATUS:
>          case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
>          case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
> +               if (!msr_info->host_initiated &&
> +                   !guest_cpuid_has(X86_FEATURE_MCA))
> +                       return 1;
>                  return set_msr_mce(vcpu, msr_info);
>   
>          case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:
> @@ -4142,13 +4145,17 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   
>                  msr_info->data = vcpu->arch.msr_kvm_poll_control;
>                  break;
> -       case MSR_IA32_P5_MC_ADDR:
> -       case MSR_IA32_P5_MC_TYPE:
>          case MSR_IA32_MCG_CAP:
>          case MSR_IA32_MCG_CTL:
>          case MSR_IA32_MCG_STATUS:
>          case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
>          case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
> +               if (!msr_info->host_initiated &&
> +                   !guest_cpuid_has(X86_FEATURE_MCA))
> +                       return 1;
> +               fallthrough;
> +       case MSR_IA32_P5_MC_ADDR:
> +       case MSR_IA32_P5_MC_TYPE:
>                  return get_msr_mce(vcpu, msr_info->index, &msr_info->data,
>                                     msr_info->host_initiated);
>          case MSR_IA32_XSS:
>