From mboxrd@z Thu Jan 1 00:00:00 1970 From: Liran Alon Subject: Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending Date: Tue, 28 Nov 2017 00:42:00 +0200 Message-ID: <5A1C94B8.8040809@ORACLE.COM> References: <1511278211-12257-1-git-send-email-liran.alon@oracle.com> <1511278211-12257-8-git-send-email-liran.alon@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Paolo Bonzini , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , kvm list , Wanpeng Li , Idan Brown , Konrad Rzeszutek Wilk To: Jim Mattson Return-path: Received: from aserp1040.oracle.com ([141.146.126.69]:30526 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752421AbdK0WmL (ORCPT ); Mon, 27 Nov 2017 17:42:11 -0500 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On 27/11/17 22:48, Jim Mattson wrote: > I am concerned about the possible conflation of L1 and L2 events. > Vmx_check_nested_events() reads as though vcpu->arch.exception.pending > is always associated with an L1 event, yet inject_pending_event() > reads as though vcpu->arch.exception.injected is associated with the > current level. That seems odd, to split this structure that way. I think you misunderstand code (rightfully as it is confusing...). vmx_check_nested_events() treats exception.pending as related to current-level (L1 or L2). Specifically, because vmx_check_nested_events() is always called when is_guest_mode(vcpu)==true, then it treats it as a "pending L2 exception". This is also why it checks if L1 intercept the "pending L2 exception" and if yes, perform an exit from L2 to L1. I understand the structures in the following way: 1. "injected" status is always associated with the current-level. (Same for interrupt.pending because it is actually interrupt.injected, see comment below marked with (*)) 2. exception.pending is associated with current-level. 3. nmi_pending is associated with L1. 4. In general, all the exception/nmi/interrupt structures are "supposed" to follow the same convention: (a) A "pending" status means that the event should be injected to guest but it's side-effects have not occurred yet (not yet injected to guest). (b) An "injected" status means that the event was already "processed" and therefore injected to guest along with it's side-effects. (*) Currently, interrupt.pending is a confusing wrong name. It actually represents "interrupt.injected". This is why I made the a patch which renames this variable: ("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected") https://marc.info/?l=kvm&m=151127826204179&w=2 (BTW, the reason why interrupt doesn't have an "interrupt.pending" flag is because it's "pending" status can be considered to be held in LAPIC IRR. This is why for example vmx_check_nested_events() call kvm_cpu_has_interrupt() to determine if there is a "pending" L1 interrupt that may cause exit from L2 to L1). > > Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set > vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual > interrupt is interrupted by another event. But > vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an > L1 "physical" interrupt should cause a VM-exit from L2 to L1, and > kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when > !lapic_in_kernel(). Clearly, there is some disagreement here over what > vcpu->arch.interrupt.pending means. Again, there is some confusion here. __vmx_complete_interrupts() only re-queues an "injected" event. (Either due to exit during event-delivery and therefore valid IDT-vectoring-info or because of injection-cancellation in vcpu_enter_guest()). Therefore, this function should only re-insert events in "injected" state. As it really does: 1. NMI is re-injected with nmi_injected status. 2. exception is injected using kvm_requeue_exception() which will mark exception.injected=true. 3. interrupt is injected using kvm_queue_interrupt() which indeed mark interrupt.pending. But as stated above, this is just a wrong name and it is actually "interrupt.injected". > > I think there may be some more fundamental problems lurking here. Yep. You are indeed right. We have found several more issues revolving treatments of pending events in regard to nested-virtualization. I am about to post another patch series which handles some such cases which relates to nested-posted-interrupts handling. Stay tuned. Your review will be extremely valuable :) Regards, -Liran > > On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon wrote: >> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with >> IDT-vectoring-info which vmx_complete_interrupts() makes sure to >> reinject before next resume of L2. >> >> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU >> to the L1 vCPU which currently runs L2 and exited to L0. >> >> When L0 will reach vcpu_enter_guest() and call inject_pending_event(), >> it will note that a previous event was re-injected to L2 (by >> IDT-vectoring-info) and therefore won't check if there are pending L1 >> events which require exit from L2 to L1. Thus, L0 enters L2 without >> immediate VMExit even though there are pending L1 events! >> >> This commit fixes the issue by making sure to check for L1 pending >> events even if a previous event was reinjected to L2 and bailing out >> from inject_pending_event() before evaluating a new pending event in >> case an event was already reinjected. >> >> The bug was observed by the following setup: >> * L0 is a 64CPU machine which runs KVM. >> * L1 is a 16CPU machine which runs KVM. >> * L0 & L1 runs with APICv disabled. >> (Also reproduced with APICv enabled but easier to analyze below info >> with APICv disabled) >> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. >> During L2 boot, L1 hangs completely and analyzing the hang reveals that >> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI >> that he has sent for another L1 vCPU. And all other L1 vCPUs are >> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck >> forever (as L1 runs with kernel-preemption disabled). >> >> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following >> series of events: >> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 >> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 >> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f >> vec 252 (Fixed|edge) >> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 >> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION >> rip 0xffffe00069202690 info 83 0 >> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 >> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 >> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: >> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 >> ext_int: 0x00000000 ext_int_err: 0x00000000 >> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >> >> Which can be analyzed as follows: >> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. >> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject >> a pending-interrupt of 0xd2. >> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination >> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. >> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which >> notes that interrupt 0xd2 was reinjected and therefore calls >> vmx_inject_irq() and returns. Without checking for pending L1 events! >> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() >> before calling inject_pending_event(). >> (4) L0 resumes L2 without immediate-exit even though there is a pending >> L1 event (The IPI pending in LAPIC's IRR). >> >> We have already reached the buggy scenario but events could be >> furthered analyzed: >> (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during >> event-delivery. >> (7) L0 decides to forward the VMExit to L1 for further handling. >> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the >> LAPIC's IRR is not examined and therefore the IPI is still not delivered >> into L1! >> >> Signed-off-by: Liran Alon >> Reviewed-by: Nikita Leshenko >> Signed-off-by: Konrad Rzeszutek Wilk >> --- >> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- >> 1 file changed, 20 insertions(+), 13 deletions(-) >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index a4b5a013064b..63359ab0e798 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> int r; >> >> /* try to reinject previous events if any */ >> - if (vcpu->arch.exception.injected) { >> - kvm_x86_ops->queue_exception(vcpu); >> - return 0; >> - } >> >> + if (vcpu->arch.exception.injected) >> + kvm_x86_ops->queue_exception(vcpu); >> /* >> * NMI/interrupt must not be injected if an exception is >> * pending, because the latter may have been queued by >> * handling exit due to NMI/interrupt delivery. >> */ >> - if (!vcpu->arch.exception.pending) { >> - if (vcpu->arch.nmi_injected) { >> + else if (!vcpu->arch.exception.pending) { >> + if (vcpu->arch.nmi_injected) >> kvm_x86_ops->set_nmi(vcpu); >> - return 0; >> - } >> - >> - if (vcpu->arch.interrupt.injected) { >> + else if (vcpu->arch.interrupt.injected) >> kvm_x86_ops->set_irq(vcpu); >> - return 0; >> - } >> } >> >> + /* >> + * Call check_nested_events() even if we reinjected a previous event >> + * in order for caller to determine if it should require immediate-exit >> + * from L2 to L1 due to pending L1 events which require exit >> + * from L2 to L1. >> + */ >> if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { >> r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); >> if (r != 0) >> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> vcpu->arch.exception.has_error_code, >> vcpu->arch.exception.error_code); >> >> + WARN_ON_ONCE(vcpu->arch.exception.injected); >> vcpu->arch.exception.pending = false; >> vcpu->arch.exception.injected = true; >> >> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> } >> >> kvm_x86_ops->queue_exception(vcpu); >> - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) { >> + } >> + >> + /* Don't consider new event if we re-injected an event */ >> + if (kvm_event_needs_reinjection(vcpu)) >> + return 0; >> + >> + if (vcpu->arch.smi_pending && !is_smm(vcpu) && >> + kvm_x86_ops->smi_allowed(vcpu)) { >> vcpu->arch.smi_pending = false; >> enter_smm(vcpu); >> } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) { >> -- >> 1.9.1 >>