From mboxrd@z Thu Jan  1 00:00:00 1970
From: Wanpeng Li <kernellwp@gmail.com>
Subject: Re: [PATCH v2] KVM: nVMX: Fix attempting to emulate "Acknowledge
 interrupt on exit" when there is no interrupt which L1 requires to inject to L2
Date: Wed, 2 Aug 2017 16:05:41 +0800
Message-ID: <CANRm+CwrKv5mbiJcTjzi8=0Yp_5musFCbAm8tSF_FE3kaZPvCg@mail.gmail.com>
References: <1501554327-3608-1-git-send-email-wanpeng.li@hotmail.com>
 <20170801195859.GB1437@flask> <CANRm+Cy3-i237im52PtvaRL-WqQYYPv935P5L1DcbrbeAFLEYQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        kvm <kvm@vger.kernel.org>, Paolo Bonzini <pbonzini@redhat.com>,
        Wanpeng Li <wanpeng.li@hotmail.com>
To: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <CANRm+Cy3-i237im52PtvaRL-WqQYYPv935P5L1DcbrbeAFLEYQ@mail.gmail.com>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: kvm.vger.kernel.org

2017-08-02 6:42 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2017-08-02 3:59 GMT+08:00 Radim Kr=C4=8Dm=C3=A1=C5=99 <rkrcmar@redhat.com=
>:
>> 2017-07-31 19:25-0700, Wanpeng Li:
>>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>>
>>> ------------[ cut here ]------------
>>>  WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexi=
t+0xd64/0xd70 [kvm_intel]
>>>  CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
>>>  RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
>>> Call Trace:
>>>   vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
>>>   ? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
>>>   kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
>>>   ? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
>>>   ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
>>>   kvm_vcpu_ioctl+0x340/0x700 [kvm]
>>>   ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
>>>   ? __fget+0xfc/0x210
>>>   do_vfs_ioctl+0xa4/0x6a0
>>>   ? __fget+0x11d/0x210
>>>   SyS_ioctl+0x79/0x90
>>>   do_syscall_64+0x8f/0x750
>>>   ? trace_hardirqs_on_thunk+0x1a/0x1c
>>>   entry_SYSCALL64_slow_path+0x25/0x25
>>>
>>> This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, =
which
>>> means that tells the kernel to not make use of any IOAPICs that may be =
present
>>> in the system.
>>>
>>> Actually external_intr variable in nested_vmx_vmexit() is the req_int_w=
in
>>> variable passed from vcpu_enter_guest() which means that the L0's users=
pace
>>> requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt=
(vcpu) &&
>>> L0's userspace reqeusts an irq window) is true, so there is no interrup=
t which
>>> L1 requires to inject to L2, we should not attempt to emualte "Acknowle=
dge
>>> interrupt on exit" for the irq window requirement in this scenario.
>>>
>>> This patch fixes it by not attempt to emulate "Acknowledge interrupt on=
 exit"
>>> if there is no L1 requirement to inject an interrupt to L2.
>>>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Radim Kr=C4=8Dm=C3=A1=C5=99 <rkrcmar@redhat.com>
>>> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
>>> ---
>>> v1 -> v2:
>>>  * update patch description
>>>  * check nested_exit_intr_ack_set() first
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> @@ -11118,8 +11118,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu *=
vcpu, u32 exit_reason,
>>>
>>>       vmx_switch_vmcs(vcpu, &vmx->vmcs01);
>>>
>>> -     if ((exit_reason =3D=3D EXIT_REASON_EXTERNAL_INTERRUPT)
>>> -         && nested_exit_intr_ack_set(vcpu)) {
>>> +     if (nested_exit_intr_ack_set(vcpu) &&
>>> +             exit_reason =3D=3D EXIT_REASON_EXTERNAL_INTERRUPT &&
>>> +             kvm_cpu_has_interrupt(vcpu)) {
>>
>> This would work as a solution, but I don't think it's the correct
>> behavior.
>>
>> SDM says that with acknowledge interrupt on exit, bit 31 of the VM-exit
>> interrupt information (valid interrupt) is always set to 1 on
>> EXIT_REASON_EXTERNAL_INTERRUPT.  We don't want to break hypervisors
>> expecting an interrupt in that case, so we should do a userspace VM exit
>> when the window is open and then inject the userspace interrupt with a
>> VM exit.
>
> Agreed.
>
>>
>> The simplest thing that came to my mind is to:
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 39a6222bf968..9ad0c882c4f5 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -10687,7 +10687,8 @@ static int vmx_check_nested_events(struct kvm_vc=
pu *vcpu, bool external_intr)
>>                 return 0;
>>         }
>>
>> -       if ((kvm_cpu_has_interrupt(vcpu) || external_intr) &&
>> +       if ((kvm_cpu_has_interrupt(vcpu) ||
>> +            (external_intr && !nested_exit_intr_ack_set(vcpu))) &&
>>             nested_exit_on_intr(vcpu)) {
>>                 if (vmx->nested.nested_run_pending)
>>                         return -EBUSY;
>>
>
> Agreed.

What's your opinion, Paolo? :) Actually I considered the above idea
before, it is what SDM defined.

Regards,
Wanpeng Li

>
>> but I think it could break more ... actually, why was the window closed?
>>
>> kvm_vcpu_ready_for_interrupt_injection() checks vmx_interrupt_allowed()
>> in order to decide need for the window, but vmx_check_nested_events()
>> doesn't care about that at all, so the window might just appear closed.
>> Would the following hunk help too?
>
> In addition, the request window can be requested by L0's userspace
> (kvm_arch_pre_run), and the idea below still can't fix in my testing.
>
> Regards,
> Wanpeng Li
>
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 39a6222bf968..7e6caa9c225d 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -5567,8 +5567,10 @@ static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)
>>
>>  static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
>>  {
>> -       return (!to_vmx(vcpu)->nested.nested_run_pending &&
>> -               vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
>> +       if (is_guest_mode(vcpu))
>> +               return !to_vmx(vcpu)->nested.nested_run_pending;
>> +
>> +       return vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF &&
>>                 !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
>>                         (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)=
);
>>  }
>>
>> (It doesn't prevent malicious userspace from hitting the WARN, though.)
>>
>> Thanks.