Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending

From: Liran Alon <LIRAN.ALON@ORACLE.COM>
To: Jim Mattson <jmattson@google.com>
Cc: "Paolo Bonzini" <pbonzini@redhat.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"kvm list" <kvm@vger.kernel.org>,
	"Wanpeng Li" <wanpeng.li@hotmail.com>,
	"Idan Brown" <idan.brown@ORACLE.COM>,
	"Konrad Rzeszutek Wilk" <konrad.wilk@ORACLE.COM>
Subject: Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
Date: Tue, 28 Nov 2017 00:42:00 +0200	[thread overview]
Message-ID: <5A1C94B8.8040809@ORACLE.COM> (raw)
In-Reply-To: <CALMp9eTPi=ovCd8nyMtRR+MaHN_TvTVCTjKbcNoZ0Juipgd4CA@mail.gmail.com>

On 27/11/17 22:48, Jim Mattson wrote:
> I am concerned about the possible conflation of L1 and L2 events.
> Vmx_check_nested_events() reads as though vcpu->arch.exception.pending
> is always associated with an L1 event, yet inject_pending_event()
> reads as though vcpu->arch.exception.injected is associated with the
> current level. That seems odd, to split this structure that way.

I think you misunderstand code (rightfully as it is confusing...). 
vmx_check_nested_events() treats exception.pending as related to 
current-level (L1 or L2). Specifically, because 
vmx_check_nested_events() is always called when 
is_guest_mode(vcpu)==true, then it treats it as a "pending L2 
exception". This is also why it checks if L1 intercept the "pending L2 
exception" and if yes, perform an exit from L2 to L1.

I understand the structures in the following way:
1. "injected" status is always associated with the current-level.
(Same for interrupt.pending because it is actually interrupt.injected, 
see comment below marked with (*))
2. exception.pending is associated with current-level.
3. nmi_pending is associated with L1.
4. In general, all the exception/nmi/interrupt structures are "supposed" 
to follow the same convention:
(a) A "pending" status means that the event should be injected to guest 
but it's side-effects have not occurred yet (not yet injected to guest).
(b) An "injected" status means that the event was already "processed" 
and therefore injected to guest along with it's side-effects.

(*) Currently, interrupt.pending is a confusing wrong name. It actually 
represents "interrupt.injected". This is why I made the a patch which 
renames this variable:
("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected")
https://marc.info/?l=kvm&m=151127826204179&w=2
(BTW, the reason why interrupt doesn't have an "interrupt.pending" flag 
is because it's "pending" status can be considered to be held in LAPIC 
IRR. This is why for example vmx_check_nested_events() call 
kvm_cpu_has_interrupt() to determine if there is a "pending" L1 
interrupt that may cause exit from L2 to L1).

>
> Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set
> vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual
> interrupt is interrupted by another event. But
> vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an
> L1 "physical" interrupt should cause a VM-exit from L2 to L1, and
> kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when
> !lapic_in_kernel(). Clearly, there is some disagreement here over what
> vcpu->arch.interrupt.pending means.

Again, there is some confusion here.

__vmx_complete_interrupts() only re-queues an "injected" event. (Either 
due to exit during event-delivery and therefore valid IDT-vectoring-info 
or because of injection-cancellation in vcpu_enter_guest()).
Therefore, this function should only re-insert events in "injected" 
state. As it really does:
1. NMI is re-injected with nmi_injected status.
2. exception is injected using kvm_requeue_exception() which will mark 
exception.injected=true.
3. interrupt is injected using kvm_queue_interrupt() which indeed mark 
interrupt.pending. But as stated above, this is just a wrong name and it 
is actually "interrupt.injected".

>
> I think there may be some more fundamental problems lurking here.
Yep. You are indeed right.
We have found several more issues revolving treatments of pending events 
in regard to nested-virtualization.
I am about to post another patch series which handles some such cases 
which relates to nested-posted-interrupts handling. Stay tuned. Your 
review will be extremely valuable :)

Regards,
-Liran

>
> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
>> reinject before next resume of L2.
>>
>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
>> to the L1 vCPU which currently runs L2 and exited to L0.
>>
>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
>> it will note that a previous event was re-injected to L2 (by
>> IDT-vectoring-info) and therefore won't check if there are pending L1
>> events which require exit from L2 to L1. Thus, L0 enters L2 without
>> immediate VMExit even though there are pending L1 events!
>>
>> This commit fixes the issue by making sure to check for L1 pending
>> events even if a previous event was reinjected to L2 and bailing out
>> from inject_pending_event() before evaluating a new pending event in
>> case an event was already reinjected.
>>
>> The bug was observed by the following setup:
>> * L0 is a 64CPU machine which runs KVM.
>> * L1 is a 16CPU machine which runs KVM.
>> * L0 & L1 runs with APICv disabled.
>> (Also reproduced with APICv enabled but easier to analyze below info
>> with APICv disabled)
>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
>> During L2 boot, L1 hangs completely and analyzing the hang reveals that
>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
>> that he has sent for another L1 vCPU. And all other L1 vCPUs are
>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
>> forever (as L1 runs with kernel-preemption disabled).
>>
>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
>> series of events:
>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
>> vec 252 (Fixed|edge)
>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
>> rip 0xffffe00069202690 info 83 0
>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
>> ext_int: 0x00000000 ext_int_err: 0x00000000
>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>
>> Which can be analyzed as follows:
>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
>> a pending-interrupt of 0xd2.
>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
>> notes that interrupt 0xd2 was reinjected and therefore calls
>> vmx_inject_irq() and returns. Without checking for pending L1 events!
>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
>> before calling inject_pending_event().
>> (4) L0 resumes L2 without immediate-exit even though there is a pending
>> L1 event (The IPI pending in LAPIC's IRR).
>>
>> We have already reached the buggy scenario but events could be
>> furthered analyzed:
>> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
>> event-delivery.
>> (7) L0 decides to forward the VMExit to L1 for further handling.
>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
>> LAPIC's IRR is not examined and therefore the IPI is still not delivered
>> into L1!
>>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> ---
>>   arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>>   1 file changed, 20 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index a4b5a013064b..63359ab0e798 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>          int r;
>>
>>          /* try to reinject previous events if any */
>> -       if (vcpu->arch.exception.injected) {
>> -               kvm_x86_ops->queue_exception(vcpu);
>> -               return 0;
>> -       }
>>
>> +       if (vcpu->arch.exception.injected)
>> +               kvm_x86_ops->queue_exception(vcpu);
>>          /*
>>           * NMI/interrupt must not be injected if an exception is
>>           * pending, because the latter may have been queued by
>>           * handling exit due to NMI/interrupt delivery.
>>           */
>> -       if (!vcpu->arch.exception.pending) {
>> -               if (vcpu->arch.nmi_injected) {
>> +       else if (!vcpu->arch.exception.pending) {
>> +               if (vcpu->arch.nmi_injected)
>>                          kvm_x86_ops->set_nmi(vcpu);
>> -                       return 0;
>> -               }
>> -
>> -               if (vcpu->arch.interrupt.injected) {
>> +               else if (vcpu->arch.interrupt.injected)
>>                          kvm_x86_ops->set_irq(vcpu);
>> -                       return 0;
>> -               }
>>          }
>>
>> +       /*
>> +        * Call check_nested_events() even if we reinjected a previous event
>> +        * in order for caller to determine if it should require immediate-exit
>> +        * from L2 to L1 due to pending L1 events which require exit
>> +        * from L2 to L1.
>> +        */
>>          if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>>                  r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>>                  if (r != 0)
>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>                                          vcpu->arch.exception.has_error_code,
>>                                          vcpu->arch.exception.error_code);
>>
>> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>>                  vcpu->arch.exception.pending = false;
>>                  vcpu->arch.exception.injected = true;
>>
>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>                  }
>>
>>                  kvm_x86_ops->queue_exception(vcpu);
>> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
>> +       }
>> +
>> +       /* Don't consider new event if we re-injected an event */
>> +       if (kvm_event_needs_reinjection(vcpu))
>> +               return 0;
>> +
>> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>> +           kvm_x86_ops->smi_allowed(vcpu)) {
>>                  vcpu->arch.smi_pending = false;
>>                  enter_smm(vcpu);
>>          } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
>> --
>> 1.9.1
>>