Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

From: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
To: Sean Christopherson <seanjc@google.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
Date: Wed, 6 Apr 2022 21:08:41 +0200	[thread overview]
Message-ID: <5135b502-ce2e-babb-7812-4d4c431a5252@maciej.szmigiero.name> (raw)
In-Reply-To: <Yk3Jd6xAfgVoFgLc@google.com>

On 6.04.2022 19:10, Sean Christopherson wrote:
> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>> On 6.04.2022 03:48, Sean Christopherson wrote:
>>> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>> (..)
>>>> Also, I'm not sure that even the proposed updated code above will
>>>> actually restore the L1-requested next_rip correctly on L1 -> L2
>>>> re-injection (will review once the full version is available).
>>>
>>> Spoiler alert, it doesn't.  Save yourself the review time.  :-)
>>>
>>> The missing piece is stashing away the injected event on nested VMRUN.  Those
>>> events don't get routed through the normal interrupt/exception injection code and
>>> so the next_rip info is lost on the subsequent #NPF.
>>>
>>> Treating soft interrupts/exceptions like they were injected by KVM (which they
>>> are, technically) works and doesn't seem too gross.  E.g. when prepping vmcb02
>>>
>>> 	if (svm->nrips_enabled)
>>> 		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
>>> 	else if (boot_cpu_has(X86_FEATURE_NRIPS))
>>> 		vmcb02->control.next_rip    = vmcb12_rip;
>>>
>>> 	if (is_evtinj_soft(vmcb02->control.event_inj)) {
>>> 		svm->soft_int_injected = true;
>>> 		svm->soft_int_csbase = svm->vmcb->save.cs.base;
>>> 		svm->soft_int_old_rip = vmcb12_rip;
>>> 		if (svm->nrips_enabled)
>>> 			svm->soft_int_next_rip = svm->nested.ctl.next_rip;
>>> 		else
>>> 			svm->soft_int_next_rip = vmcb12_rip;
>>> 	}
>>>
>>> And then the VMRUN error path just needs to clear soft_int_injected.
>>
>> I am also a fan of parsing EVENTINJ from VMCB12 into relevant KVM
>> injection structures (much like EXITINTINFO is parsed), as I said to
>> Maxim two days ago [1].
>> Not only for software {interrupts,exceptions} but for all incoming
>> events (again, just like EXITINTINFO).
> 
> Ahh, I saw that fly by, but somehow I managed to misread what you intended.
> 
> I like the idea of populating vcpu->arch.interrupt/exception as "injected" events.
> KVM prioritizes "injected" over other nested events, so in theory it should work
> without too much fuss.  I've ran through a variety of edge cases in my head and
> haven't found anything that would be fundamentally broken.  I think even live
> migration would work.
> 
> I think I'd prefer to do that in a follow-up series so that nVMX can be converted
> at the same time?  It's a bit absurd to add the above soft int code knowing that,
> at least in theory, simply populating the right software structs would automagically
> fix the bug.  But manually handling the soft int case first would be safer in the
> sense that we'd still have a fix for the soft int case if it turns out that populating
> vcpu->arch.interrupt/exception isn't as straightfoward as it seems.

I don't see any problem with having two patch series, the second series
depending on the first.

I planned to at least fix the bugs that I described in my previous message
(the NMI one actually breaks a real Windows guest) but that needs
knowledge how the event injection code will finally look like after the
first series of fixes.

>> However, there is another issue related to L1 -> L2 event re-injection
>> using standard KVM event injection mechanism: it mixes the L1 injection
>> state with the L2 one.
>>
>> Specifically for SVM:
>> * When re-injecting a NMI into L2 NMI-blocking is enabled in
>> vcpu->arch.hflags (shared between L1 and L2) and IRET intercept is
>> enabled.
>>
>> This is incorrect, since it is L1 that is responsible for enforcing NMI
>> blocking for NMIs that it injects into its L2.
> 
> Ah, I see what you're saying.  I think :-)  IIUC, we can fix this bug without any
> new flags, just skip the side effects if the NMI is being injected into L2.
> 
> @@ -3420,6 +3424,10 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
>          struct vcpu_svm *svm = to_svm(vcpu);
> 
>          svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
> +
> +       if (is_guest_mode(vcpu))
> +               return;
> +
>          vcpu->arch.hflags |= HF_NMI_MASK;
>          if (!sev_es_guest(vcpu->kvm))
>                  svm_set_intercept(svm, INTERCEPT_IRET);
> 
> and for nVMX:
> 
> @@ -4598,6 +4598,9 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
>   {
>          struct vcpu_vmx *vmx = to_vmx(vcpu);
> 
> +       if (is_guest_mode(vcpu))
> +               goto inject_nmi;
> +
>          if (!enable_vnmi) {
>                  /*
>                   * Tracking the NMI-blocked state in software is built upon
> @@ -4619,6 +4622,7 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
>                  return;
>          }
> 
> +inject_nmi:
>          vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
>                          INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR);
> 

And what if it's L0 that is trying to inject a NMI into L2?
In this case is_guest_mode() is true, but the full NMI injection machinery
should be used.

>> Also, *L2* being the target of such injection definitely should not block
>> further NMIs for *L1*.
> 
> Actually, it should block NMIs for L1.  From L1's perspective, the injection is
> part of VM-Entry.  That's a single gigantic instruction, thus there is no NMI window
> until VM-Entry completes from L1's perspetive.  Any exit that occurs on vectoring
> an injected event and is handled by L0 should not be visible to L1, because from
> L1's perspective it's all part of VMRUN/VMLAUNCH/VMRESUME.  So blocking new events
> because an NMI (or any event) needs to be reinjected for L2 is correct.

I think this kind of NMI blocking will be already handled by having
the pending new NMI in vcpu->arch.nmi_pending but the one that needs
re-injecting in vcpu->arch.nmi_injected.

The pending new NMI in vcpu->arch.nmi_pending won't be handled until
vcpu->arch.nmi_injected gets cleared (that is, until re-injection is
successful).

It is incorrect however, to wait for L2 to execute IRET to unblock
L0 -> L1 NMIs or L1 -> L2 NMIs, in these two cases we (L0) just need the CPU
to vector that L2 NMI so it no longer shows in EXITINTINFO.

It is also incorrect to block L1 -> L2 NMI injection because either L1
or L2 is currently under NMI blocking: the first case is obvious,
the second because it's L1 that is supposed to take care of proper NMI
blocking for L2 when injecting an NMI there.

>> * When re-injecting a *hardware* IRQ into L2 GIF is checked (previously
>> even on the BUG_ON() level), while L1 should be able to inject even when
>> L2 GIF is off,
> 
> Isn't that just a matter of tweaking the assertion to ignore GIF if L2 is
> active?  Hmm, or deleting the assertion altogether, it's likely doing more harm
> than good at this point.

I assume this assertion is meant to catch the case when KVM itself (L0) is
trying to erroneously inject a hardware interrupt into L1 or L2, so it will
need to be skipped only for L1 -> L2 event injection.

Whether this assertion benefits outweigh its costs is debatable - don't have
a strong opinion here (BUG_ON() is for sure too strong, but WARN_ON_ONCE()
might make sense to catch latent bugs).

>> With the code in my previous patch set I planned to use
>> exit_during_event_injection() to detect such case, but if we implement
>> VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
>> comes from L1, so its normal injection side-effects should be skipped.
> 
> Do we still need a flag based on the above?  Honest question... I've been staring
> at all this for the better part of an hour and may have lost track of things.

If checking just is_guest_mode() is not enough due to reasons I described
above then we need to somehow determine in the NMI / IRQ injection handler
whether the event to be injected into L2 comes from L0 or L1.
For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.

>> By the way, the relevant VMX code also looks rather suspicious,
>> especially for the !enable_vnmi case.
> 
> I think it's safe to say we can ignore edge cases for !enable_vnmi.  It might even
> be worth trying to remove that support again (Paolo tried years ago), IIRC the
> only Intel CPUs that don't support virtual NMIs are some funky Yonah SKUs.

Ack, we could at least disable nested on !enable_vnmi.

BTW, I think that besides Yonah cores very early Core 2 CPUs also lacked
vNMI support, that's why !enable_vnmi support was reinstated.
But that's hardware even older than !nrips AMD parts.

Thanks,
Maciej