All of lore.kernel.org
 help / color / mirror / Atom feed
From: Liran Alon <LIRAN.ALON@ORACLE.COM>
To: Jim Mattson <jmattson@google.com>
Cc: "Paolo Bonzini" <pbonzini@redhat.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"kvm list" <kvm@vger.kernel.org>,
	"Wanpeng Li" <wanpeng.li@hotmail.com>,
	"Idan Brown" <idan.brown@ORACLE.COM>,
	"Konrad Rzeszutek Wilk" <konrad.wilk@ORACLE.COM>
Subject: Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
Date: Tue, 28 Nov 2017 00:42:00 +0200	[thread overview]
Message-ID: <5A1C94B8.8040809@ORACLE.COM> (raw)
In-Reply-To: <CALMp9eTPi=ovCd8nyMtRR+MaHN_TvTVCTjKbcNoZ0Juipgd4CA@mail.gmail.com>



On 27/11/17 22:48, Jim Mattson wrote:
> I am concerned about the possible conflation of L1 and L2 events.
> Vmx_check_nested_events() reads as though vcpu->arch.exception.pending
> is always associated with an L1 event, yet inject_pending_event()
> reads as though vcpu->arch.exception.injected is associated with the
> current level. That seems odd, to split this structure that way.

I think you misunderstand code (rightfully as it is confusing...). 
vmx_check_nested_events() treats exception.pending as related to 
current-level (L1 or L2). Specifically, because 
vmx_check_nested_events() is always called when 
is_guest_mode(vcpu)==true, then it treats it as a "pending L2 
exception". This is also why it checks if L1 intercept the "pending L2 
exception" and if yes, perform an exit from L2 to L1.

I understand the structures in the following way:
1. "injected" status is always associated with the current-level.
(Same for interrupt.pending because it is actually interrupt.injected, 
see comment below marked with (*))
2. exception.pending is associated with current-level.
3. nmi_pending is associated with L1.
4. In general, all the exception/nmi/interrupt structures are "supposed" 
to follow the same convention:
(a) A "pending" status means that the event should be injected to guest 
but it's side-effects have not occurred yet (not yet injected to guest).
(b) An "injected" status means that the event was already "processed" 
and therefore injected to guest along with it's side-effects.

(*) Currently, interrupt.pending is a confusing wrong name. It actually 
represents "interrupt.injected". This is why I made the a patch which 
renames this variable:
("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected")
https://marc.info/?l=kvm&m=151127826204179&w=2
(BTW, the reason why interrupt doesn't have an "interrupt.pending" flag 
is because it's "pending" status can be considered to be held in LAPIC 
IRR. This is why for example vmx_check_nested_events() call 
kvm_cpu_has_interrupt() to determine if there is a "pending" L1 
interrupt that may cause exit from L2 to L1).

>
> Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set
> vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual
> interrupt is interrupted by another event. But
> vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an
> L1 "physical" interrupt should cause a VM-exit from L2 to L1, and
> kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when
> !lapic_in_kernel(). Clearly, there is some disagreement here over what
> vcpu->arch.interrupt.pending means.

Again, there is some confusion here.

__vmx_complete_interrupts() only re-queues an "injected" event. (Either 
due to exit during event-delivery and therefore valid IDT-vectoring-info 
or because of injection-cancellation in vcpu_enter_guest()).
Therefore, this function should only re-insert events in "injected" 
state. As it really does:
1. NMI is re-injected with nmi_injected status.
2. exception is injected using kvm_requeue_exception() which will mark 
exception.injected=true.
3. interrupt is injected using kvm_queue_interrupt() which indeed mark 
interrupt.pending. But as stated above, this is just a wrong name and it 
is actually "interrupt.injected".

>
> I think there may be some more fundamental problems lurking here.
Yep. You are indeed right.
We have found several more issues revolving treatments of pending events 
in regard to nested-virtualization.
I am about to post another patch series which handles some such cases 
which relates to nested-posted-interrupts handling. Stay tuned. Your 
review will be extremely valuable :)

Regards,
-Liran

>
> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
>> reinject before next resume of L2.
>>
>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
>> to the L1 vCPU which currently runs L2 and exited to L0.
>>
>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
>> it will note that a previous event was re-injected to L2 (by
>> IDT-vectoring-info) and therefore won't check if there are pending L1
>> events which require exit from L2 to L1. Thus, L0 enters L2 without
>> immediate VMExit even though there are pending L1 events!
>>
>> This commit fixes the issue by making sure to check for L1 pending
>> events even if a previous event was reinjected to L2 and bailing out
>> from inject_pending_event() before evaluating a new pending event in
>> case an event was already reinjected.
>>
>> The bug was observed by the following setup:
>> * L0 is a 64CPU machine which runs KVM.
>> * L1 is a 16CPU machine which runs KVM.
>> * L0 & L1 runs with APICv disabled.
>> (Also reproduced with APICv enabled but easier to analyze below info
>> with APICv disabled)
>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
>> During L2 boot, L1 hangs completely and analyzing the hang reveals that
>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
>> that he has sent for another L1 vCPU. And all other L1 vCPUs are
>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
>> forever (as L1 runs with kernel-preemption disabled).
>>
>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
>> series of events:
>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
>> vec 252 (Fixed|edge)
>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
>> rip 0xffffe00069202690 info 83 0
>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
>> ext_int: 0x00000000 ext_int_err: 0x00000000
>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>
>> Which can be analyzed as follows:
>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
>> a pending-interrupt of 0xd2.
>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
>> notes that interrupt 0xd2 was reinjected and therefore calls
>> vmx_inject_irq() and returns. Without checking for pending L1 events!
>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
>> before calling inject_pending_event().
>> (4) L0 resumes L2 without immediate-exit even though there is a pending
>> L1 event (The IPI pending in LAPIC's IRR).
>>
>> We have already reached the buggy scenario but events could be
>> furthered analyzed:
>> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
>> event-delivery.
>> (7) L0 decides to forward the VMExit to L1 for further handling.
>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
>> LAPIC's IRR is not examined and therefore the IPI is still not delivered
>> into L1!
>>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> ---
>>   arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>>   1 file changed, 20 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index a4b5a013064b..63359ab0e798 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>          int r;
>>
>>          /* try to reinject previous events if any */
>> -       if (vcpu->arch.exception.injected) {
>> -               kvm_x86_ops->queue_exception(vcpu);
>> -               return 0;
>> -       }
>>
>> +       if (vcpu->arch.exception.injected)
>> +               kvm_x86_ops->queue_exception(vcpu);
>>          /*
>>           * NMI/interrupt must not be injected if an exception is
>>           * pending, because the latter may have been queued by
>>           * handling exit due to NMI/interrupt delivery.
>>           */
>> -       if (!vcpu->arch.exception.pending) {
>> -               if (vcpu->arch.nmi_injected) {
>> +       else if (!vcpu->arch.exception.pending) {
>> +               if (vcpu->arch.nmi_injected)
>>                          kvm_x86_ops->set_nmi(vcpu);
>> -                       return 0;
>> -               }
>> -
>> -               if (vcpu->arch.interrupt.injected) {
>> +               else if (vcpu->arch.interrupt.injected)
>>                          kvm_x86_ops->set_irq(vcpu);
>> -                       return 0;
>> -               }
>>          }
>>
>> +       /*
>> +        * Call check_nested_events() even if we reinjected a previous event
>> +        * in order for caller to determine if it should require immediate-exit
>> +        * from L2 to L1 due to pending L1 events which require exit
>> +        * from L2 to L1.
>> +        */
>>          if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>>                  r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>>                  if (r != 0)
>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>                                          vcpu->arch.exception.has_error_code,
>>                                          vcpu->arch.exception.error_code);
>>
>> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>>                  vcpu->arch.exception.pending = false;
>>                  vcpu->arch.exception.injected = true;
>>
>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>                  }
>>
>>                  kvm_x86_ops->queue_exception(vcpu);
>> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
>> +       }
>> +
>> +       /* Don't consider new event if we re-injected an event */
>> +       if (kvm_event_needs_reinjection(vcpu))
>> +               return 0;
>> +
>> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>> +           kvm_x86_ops->smi_allowed(vcpu)) {
>>                  vcpu->arch.smi_pending = false;
>>                  enter_smm(vcpu);
>>          } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
>> --
>> 1.9.1
>>

  reply	other threads:[~2017-11-27 22:42 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-21 15:30 [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events Liran Alon
2017-11-21 15:30 ` [PATCH v2 1/8] KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt Liran Alon
2017-12-01 23:45   ` Jim Mattson
2017-12-02  0:19     ` Liran Alon
2017-11-21 15:30 ` [PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected Liran Alon
2017-11-28 17:02   ` Jim Mattson
2017-11-21 15:30 ` [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions Liran Alon
2017-11-22 20:25   ` Radim Krčmář
2017-11-23 18:20     ` Liran Alon
2017-11-22 21:00   ` Jim Mattson
2017-11-23 18:45     ` Liran Alon
2017-11-23 23:05       ` Paolo Bonzini
2017-11-27 17:26       ` Jim Mattson
2017-11-27 18:30         ` Liran Alon
2017-11-21 15:30 ` [PATCH v2 4/8] KVM: x86: Warn if userspace overrides existing injected exception/interrupt Liran Alon
2017-11-22 20:34   ` Radim Krčmář
2017-11-22 22:27     ` Liran Alon
2017-11-21 15:30 ` [PATCH v2 5/8] Revert "kvm: nVMX: Disallow userspace-injected exceptions in guest mode" Liran Alon
2017-11-21 15:30 ` [PATCH v2 6/8] KVM: x86: Fix misleading comments on handling pending exceptions Liran Alon
2017-11-21 15:30 ` [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending Liran Alon
2017-11-27 20:48   ` Jim Mattson
2017-11-27 22:42     ` Liran Alon [this message]
2017-11-28  4:55       ` Jim Mattson
2017-11-28 11:14         ` Paolo Bonzini
2017-11-28 13:59           ` Liran Alon
2017-11-28 11:36         ` Liran Alon
2017-11-28  6:39   ` Jim Mattson
2017-11-28 18:26     ` Jim Mattson
2017-11-28 19:45       ` Liran Alon
2017-11-28 21:04         ` Jim Mattson
2017-11-28 19:33     ` Liran Alon
2017-11-21 15:30 ` [PATCH v2 8/8] KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending Liran Alon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5A1C94B8.8040809@ORACLE.COM \
    --to=liran.alon@oracle.com \
    --cc=idan.brown@ORACLE.COM \
    --cc=jmattson@google.com \
    --cc=konrad.wilk@ORACLE.COM \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    --cc=wanpeng.li@hotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.