All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vitaly Kuznetsov <vkuznets@redhat.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Reiji Watanabe <reijiw@google.com>
Subject: Re: [PATCH 2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset()
Date: Thu, 16 Sep 2021 09:19:12 +0200	[thread overview]
Message-ID: <87wnnhyolr.fsf@vitty.brq.redhat.com> (raw)
In-Reply-To: <YUIunxwjea/wq3gd@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Wed, Sep 15, 2021, Vitaly Kuznetsov wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> > +static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>> > +{
>> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> > +
>> > +	init_vmcs(vmx);
>> > +
>> > +	if (nested)
>> > +		memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs));
>> > +
>> > +	vcpu_setup_sgx_lepubkeyhash(vcpu);
>> > +
>> > +	vmx->nested.posted_intr_nv = -1;
>> > +	vmx->nested.current_vmptr = -1ull;
>> > +	vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID;
>> 
>> What would happen in (hypothetical) case when enlightened VMCS is
>> currently in use? If we zap 'hv_evmcs_vmptr' here, the consequent
>> nested_release_evmcs() (called from
>> nested_vmx_handle_enlightened_vmptrld(), for example) will not do 
>> kvm_vcpu_unmap() while it should.
>
> The short answer is that there's a lot of stuff that needs to be addressed before
> KVM can expose a RESET ioctl().  My goal with these patches is to carve out the
> stubs and move the few bits of RESET emulation into the "stubs".  This is the same
> answer for the MSR question/comment at the end.
>
>> This, however, got me thinking: should we free all-things-nested with
>> free_nested()/nested_vmx_free_vcpu() upon vcpu reset? I can't seem to
>> find us doing that... (I do remember that INIT is blocked in VMX-root
>> mode and nobody else besides kvm_arch_vcpu_create()/
>> kvm_apic_accept_events() seems to call kvm_vcpu_reset()) but maybe we
>> should at least add a WARN_ON() guardian here?
>
> I think that makes sense.  Maybe use CR0 as a sentinel since it has a non-zero
> RESET value?  E.g. WARN if CR0 is non-zero at RESET.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 86539c1686fa..3ac074376821 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10813,6 +10813,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         unsigned long new_cr0;
>         u32 eax, dummy;
>
> +       /*
> +        * <comment about KVM not supporting arbitrary RESET>
> +        */
> +       WARN_ON_ONCE(!init_event && old_cr0);
> +
>         kvm_lapic_reset(vcpu, init_event);
>
>         vcpu->arch.hflags = 0;
>

Should work (assuming nothing in VMX would want to call vmx_vcpu_reset()/
__vmx_vcpu_reset() directly).

> Huh, typing that out made me realize commit 0aa1837533e5 ("KVM: x86: Properly
> reset MMU context at vCPU RESET/INIT") technically introduced a bug.  kvm_vcpu_reset()
> does kvm_read_cr0() and thus reads vmcs.GUEST_CR0 because vcpu->arch.regs_avail is
> (correctly) not stuffed to ALL_ONES until later in kvm_vcpu_reset().  init_vmcs()
> doesn't explicitly zero vmcs.GUEST_CR0 (along with many other guest fields), and
> so VMREAD(GUEST_CR0) is technically consuming garbage.  In practice, it's consuming
> '0' because no known CPU or VMM inverts values in the VMCS, i.e. zero allocating
> the VMCS is functionally equivalent to writing '0' to all fields via VMWRITE.
>
> And staring more at kvm_vcpu_reset(), this code is terrifying for INIT
>
> 	memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
> 	vcpu->arch.regs_avail = ~0;
> 	vcpu->arch.regs_dirty = ~0;
>
> because it means cr0 and cr4 are marked available+dirty without immediately writing
> vcpu->arch.cr0/cr4.  And VMX subtly relies on that, as vmx_set_cr0() grabs CR0.PG
> via kvm_read_cr0_bits(), i.e. zeroing vcpu->arch.cr0 would "break" the INIT flow.
> Ignoring for the moment that CR0.PG is never guest-owned and thus never stale in
> vcpu->arch.cr0, KVM is also technically relying on the earlier kvm_read_cr0() in
> kvm_vcpu_reset() to ensure vcpu->arch.cr0 is fresh.
>
> Stuffing regs_avail technically means vmx_set_rflags() -> vmx_get_rflags() is
> consuming stale data.  It doesn't matter in practice because the old value is
> only used to re-evaluate vmx->emulation_required, which is guaranteed to be up
> refreshed by vmx_set_cr0() and friends.  PDPTRs and EXIT_INFO are in a similar
> boat; KVM shouldn't be reading those fields (CR0.PG=0, not an exit path), but
> marking them dirty without actually updating the cached values is wrong.
>
> There's also one concrete bug: KVM doesn't set vcpu->arch.cr3=0 on RESET/INIT.
> That bug has gone unnoticed because no real word BIOS/kernel is going to rely on
> INIT to set CR3=0.  
>
> I'm strongly leaning towards stuffing regs_avail/dirty in kvm_arch_vcpu_create(),
> and relying on explicit kvm_register_mark_dirty() calls for the 3 or so cases where
> x86 code writes a vcpu->arch register directly.  That would fix the CR0 read bug
> and also prevent subtle bugs from sneaking in.  Adding new EXREGS would be slightly
> more costly, but IMO that's a good thing as would force us to actually think about
> how to handle each register.
>
> E.g. implement this over 2-3 patches:
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 114847253e0a..743146ac8307 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4385,6 +4385,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         kvm_set_cr8(vcpu, 0);
>
>         vmx_segment_cache_clear(vmx);
> +       kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
>
>         seg_setup(VCPU_SREG_CS);
>         vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 86539c1686fa..ab907a0b9eeb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10656,6 +10656,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         int r;
>
>         vcpu->arch.last_vmentry_cpu = -1;
> +       vcpu->arch.regs_avail = ~0;
> +       vcpu->arch.regs_dirty = ~0;
>
>         if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
>                 vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> @@ -10874,9 +10876,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>                 vcpu->arch.xcr0 = XFEATURE_MASK_FP;
>         }
>
> +       /* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
>         memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
> -       vcpu->arch.regs_avail = ~0;
> -       vcpu->arch.regs_dirty = ~0;
> +       kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP);
>
>         /*
>          * Fall back to KVM's default Family/Model/Stepping of 0x600 (P6/Athlon)
> @@ -10897,6 +10899,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
>         kvm_rip_write(vcpu, 0xfff0);
>
> +       vcpu->arch.cr3 = 0;
> +       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> +
>         /*
>          * CR0.CD/NW are set on RESET, preserved on INIT.  Note, some versions
>          * of Intel's SDM list CD/NW as being set on INIT, but they contradict
>

A selftest for vCPU create/reset would be really helpful. I can even
volunteer to [eventually] write one :-)

-- 
Vitaly


  reply	other threads:[~2021-09-16  7:19 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-14 23:08 [PATCH 0/3] KVM: x86: Clean up RESET "emulation" Sean Christopherson
2021-09-14 23:08 ` [PATCH 1/3] KVM: VMX: Drop explicit zeroing of MSR guest values at vCPU creation Sean Christopherson
2021-09-14 23:08 ` [PATCH 2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset() Sean Christopherson
2021-09-15 10:30   ` Vitaly Kuznetsov
2021-09-15 17:34     ` Sean Christopherson
2021-09-16  7:19       ` Vitaly Kuznetsov [this message]
2021-09-16 19:01         ` Sean Christopherson
2021-09-14 23:08 ` [PATCH 3/3] KVM: SVM: Move RESET emulation to svm_vcpu_reset() Sean Christopherson
2021-09-17 16:15 ` [PATCH 0/3] KVM: x86: Clean up RESET "emulation" Paolo Bonzini
2021-09-17 17:34   ` Sean Christopherson
2021-09-17 17:37     ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wnnhyolr.fsf@vitty.brq.redhat.com \
    --to=vkuznets@redhat.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=reijiw@google.com \
    --cc=seanjc@google.com \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.