KVM: x86: inject exceptions produced by x86_decode_insn
diff mbox series

Message ID 1510307378-97452-1-git-send-email-pbonzini@redhat.com
State New, archived
Headers show
Series
  • KVM: x86: inject exceptions produced by x86_decode_insn
Related show

Commit Message

Paolo Bonzini Nov. 10, 2017, 9:49 a.m. UTC
Sometimes, a processor might execute an instruction while another
processor is updating the page tables for that instruction's code page,
but before the TLB shootdown completes.  The interesting case happens
if the page is in the TLB.

In general, the processor will succeed in executing the instruction and
nothing bad happens.  However, what if the instruction is an MMIO access?
If *that* happens, KVM invokes the emulator, and the emulator gets the
updated page tables.  If the update side had marked the code page as non
present, the page table walk then will fail and so will x86_decode_insn.

Unfortunately, even though kvm_fetch_guest_virt is correctly returning
X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
a fatal error if the instruction cannot simply be reexecuted (as is the
case for MMIO).  And this in fact happened sometimes when rebooting
Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
the exception if true is enough to fix the case.

Thanks to Eduardo Habkost for helping in the debugging of this issue.

Reported-by: Yanan Fu <yfu@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/x86.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Radim Krčmář Nov. 10, 2017, 9:42 p.m. UTC | #1
2017-11-10 10:49+0100, Paolo Bonzini:
> Sometimes, a processor might execute an instruction while another
> processor is updating the page tables for that instruction's code page,
> but before the TLB shootdown completes.  The interesting case happens
> if the page is in the TLB.
> 
> In general, the processor will succeed in executing the instruction and
> nothing bad happens.  However, what if the instruction is an MMIO access?
> If *that* happens, KVM invokes the emulator, and the emulator gets the
> updated page tables.  If the update side had marked the code page as non
> present, the page table walk then will fail and so will x86_decode_insn.
> 
> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> a fatal error if the instruction cannot simply be reexecuted (as is the
> case for MMIO).  And this in fact happened sometimes when rebooting
> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> the exception if true is enough to fix the case.
> 
> Thanks to Eduardo Habkost for helping in the debugging of this issue.
> 
> Reported-by: Yanan Fu <yfu@redhat.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

Applied, thanks.
Wanpeng Li Nov. 13, 2017, 7:15 a.m. UTC | #2
2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> Sometimes, a processor might execute an instruction while another
> processor is updating the page tables for that instruction's code page,
> but before the TLB shootdown completes.  The interesting case happens
> if the page is in the TLB.
>
> In general, the processor will succeed in executing the instruction and
> nothing bad happens.  However, what if the instruction is an MMIO access?
> If *that* happens, KVM invokes the emulator, and the emulator gets the
> updated page tables.  If the update side had marked the code page as non
> present, the page table walk then will fail and so will x86_decode_insn.
>
> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> a fatal error if the instruction cannot simply be reexecuted (as is the
> case for MMIO).  And this in fact happened sometimes when rebooting
> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> the exception if true is enough to fix the case.

I found the only place which can set ctxt->have_exception is in the
function x86_emulate_insn(), and x86_decode_insn() will not set
ctxt->have_exception even if kvm_fetch_guest_virt() returns
X86_EMUL_PROPAGATE_FAULT.

Regards,
Wanpeng Li

>
> Thanks to Eduardo Habkost for helping in the debugging of this issue.
>
> Reported-by: Yanan Fu <yfu@redhat.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 34c85aa2e2d1..6dbed9022797 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5722,6 +5722,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>                         if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
>                                                 emulation_type))
>                                 return EMULATE_DONE;
> +                       if (ctxt->have_exception && inject_emulated_exception(vcpu))
> +                               return EMULATE_DONE;
>                         if (emulation_type & EMULTYPE_SKIP)
>                                 return EMULATE_FAIL;
>                         return handle_emulation_failure(vcpu);
> --
> 1.8.3.1
>
Paolo Bonzini Nov. 13, 2017, 8:32 a.m. UTC | #3
On 13/11/2017 08:15, Wanpeng Li wrote:
> 2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>> Sometimes, a processor might execute an instruction while another
>> processor is updating the page tables for that instruction's code page,
>> but before the TLB shootdown completes.  The interesting case happens
>> if the page is in the TLB.
>>
>> In general, the processor will succeed in executing the instruction and
>> nothing bad happens.  However, what if the instruction is an MMIO access?
>> If *that* happens, KVM invokes the emulator, and the emulator gets the
>> updated page tables.  If the update side had marked the code page as non
>> present, the page table walk then will fail and so will x86_decode_insn.
>>
>> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
>> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
>> a fatal error if the instruction cannot simply be reexecuted (as is the
>> case for MMIO).  And this in fact happened sometimes when rebooting
>> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
>> the exception if true is enough to fix the case.
> 
> I found the only place which can set ctxt->have_exception is in the
> function x86_emulate_insn(), and x86_decode_insn() will not set
> ctxt->have_exception even if kvm_fetch_guest_virt() returns
> X86_EMUL_PROPAGATE_FAULT.

Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
this patch! :(

Yanan, can you double check that you can reproduce the issue with an
unpatched kernel?  I will work on a kvm-unit-tests testcsae

Paolo

> Regards,
> Wanpeng Li
> 
>>
>> Thanks to Eduardo Habkost for helping in the debugging of this issue.
>>
>> Reported-by: Yanan Fu <yfu@redhat.com>
>> Cc: Eduardo Habkost <ehabkost@redhat.com>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  arch/x86/kvm/x86.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 34c85aa2e2d1..6dbed9022797 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5722,6 +5722,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>>                         if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
>>                                                 emulation_type))
>>                                 return EMULATE_DONE;
>> +                       if (ctxt->have_exception && inject_emulated_exception(vcpu))
>> +                               return EMULATE_DONE;
>>                         if (emulation_type & EMULTYPE_SKIP)
>>                                 return EMULATE_FAIL;
>>                         return handle_emulation_failure(vcpu);
>> --
>> 1.8.3.1
>>
Yanan Fu Nov. 13, 2017, 10:09 a.m. UTC | #4
----- Original Message -----
> From: "Paolo Bonzini" <pbonzini@redhat.com>
> To: "Wanpeng Li" <kernellwp@gmail.com>
> Cc: linux-kernel@vger.kernel.org, "kvm" <kvm@vger.kernel.org>, yfu@redhat.com, "Eduardo Habkost"
> <ehabkost@redhat.com>
> Sent: Monday, November 13, 2017 4:32:09 PM
> Subject: Re: [PATCH] KVM: x86: inject exceptions produced by x86_decode_insn
> 
> On 13/11/2017 08:15, Wanpeng Li wrote:
> > 2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> >> Sometimes, a processor might execute an instruction while another
> >> processor is updating the page tables for that instruction's code page,
> >> but before the TLB shootdown completes.  The interesting case happens
> >> if the page is in the TLB.
> >>
> >> In general, the processor will succeed in executing the instruction and
> >> nothing bad happens.  However, what if the instruction is an MMIO access?
> >> If *that* happens, KVM invokes the emulator, and the emulator gets the
> >> updated page tables.  If the update side had marked the code page as non
> >> present, the page table walk then will fail and so will x86_decode_insn.
> >>
> >> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> >> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> >> a fatal error if the instruction cannot simply be reexecuted (as is the
> >> case for MMIO).  And this in fact happened sometimes when rebooting
> >> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> >> the exception if true is enough to fix the case.
> > 
> > I found the only place which can set ctxt->have_exception is in the
> > function x86_emulate_insn(), and x86_decode_insn() will not set
> > ctxt->have_exception even if kvm_fetch_guest_virt() returns
> > X86_EMUL_PROPAGATE_FAULT.
> 
> Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
> this patch! :(
> 
> Yanan, can you double check that you can reproduce the issue with an
> unpatched kernel?  I will work on a kvm-unit-tests testcsae

Hi Paolo, 
Yes, i still can reproduce it. In the latest acceptance testing which i just
finished this afternoon, 7 cases failed as this problem (all for win2012.r2 guest) 

And, with the scratch build that be provides in bz 1493501, i repeat 30 times, it
is ok. Thanks !


Best Wishes
Yanan Fu

> 
> Paolo
> 
> > Regards,
> > Wanpeng Li
> > 
> >>
> >> Thanks to Eduardo Habkost for helping in the debugging of this issue.
> >>
> >> Reported-by: Yanan Fu <yfu@redhat.com>
> >> Cc: Eduardo Habkost <ehabkost@redhat.com>
> >> Cc: stable@vger.kernel.org
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> ---
> >>  arch/x86/kvm/x86.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index 34c85aa2e2d1..6dbed9022797 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -5722,6 +5722,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
> >>                         if (reexecute_instruction(vcpu, cr2,
> >>                         write_fault_to_spt,
> >>                                                 emulation_type))
> >>                                 return EMULATE_DONE;
> >> +                       if (ctxt->have_exception &&
> >> inject_emulated_exception(vcpu))
> >> +                               return EMULATE_DONE;
> >>                         if (emulation_type & EMULTYPE_SKIP)
> >>                                 return EMULATE_FAIL;
> >>                         return handle_emulation_failure(vcpu);
> >> --
> >> 1.8.3.1
> >>
> 
>
Radim Krčmář Nov. 16, 2017, 5:12 p.m. UTC | #5
2017-11-13 09:32+0100, Paolo Bonzini:
> On 13/11/2017 08:15, Wanpeng Li wrote:
> > 2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> >> Sometimes, a processor might execute an instruction while another
> >> processor is updating the page tables for that instruction's code page,
> >> but before the TLB shootdown completes.  The interesting case happens
> >> if the page is in the TLB.
> >>
> >> In general, the processor will succeed in executing the instruction and
> >> nothing bad happens.  However, what if the instruction is an MMIO access?
> >> If *that* happens, KVM invokes the emulator, and the emulator gets the
> >> updated page tables.  If the update side had marked the code page as non
> >> present, the page table walk then will fail and so will x86_decode_insn.
> >>
> >> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> >> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> >> a fatal error if the instruction cannot simply be reexecuted (as is the
> >> case for MMIO).  And this in fact happened sometimes when rebooting
> >> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> >> the exception if true is enough to fix the case.
> > 
> > I found the only place which can set ctxt->have_exception is in the
> > function x86_emulate_insn(), and x86_decode_insn() will not set
> > ctxt->have_exception even if kvm_fetch_guest_virt() returns
> > X86_EMUL_PROPAGATE_FAULT.
> 
> Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
> this patch! :(

I have dropped this patch in the meantime.
Eduardo Habkost Nov. 29, 2017, 11:44 a.m. UTC | #6
On Mon, Nov 13, 2017 at 09:32:09AM +0100, Paolo Bonzini wrote:
> On 13/11/2017 08:15, Wanpeng Li wrote:
> > 2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> >> Sometimes, a processor might execute an instruction while another
> >> processor is updating the page tables for that instruction's code page,
> >> but before the TLB shootdown completes.  The interesting case happens
> >> if the page is in the TLB.
> >>
> >> In general, the processor will succeed in executing the instruction and
> >> nothing bad happens.  However, what if the instruction is an MMIO access?
> >> If *that* happens, KVM invokes the emulator, and the emulator gets the
> >> updated page tables.  If the update side had marked the code page as non
> >> present, the page table walk then will fail and so will x86_decode_insn.
> >>
> >> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> >> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> >> a fatal error if the instruction cannot simply be reexecuted (as is the
> >> case for MMIO).  And this in fact happened sometimes when rebooting
> >> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> >> the exception if true is enough to fix the case.
> > 
> > I found the only place which can set ctxt->have_exception is in the
> > function x86_emulate_insn(), and x86_decode_insn() will not set
> > ctxt->have_exception even if kvm_fetch_guest_virt() returns
> > X86_EMUL_PROPAGATE_FAULT.
> 
> Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
> this patch! :(
> 
> Yanan, can you double check that you can reproduce the issue with an
> unpatched kernel?  I will work on a kvm-unit-tests testcsae

We don't have a kvm-unit-tests reproducer for this yet, right?

I'm considering trying to write one, but I don't want to
duplicate work.
Paolo Bonzini Nov. 29, 2017, 11:44 a.m. UTC | #7
On 29/11/2017 12:44, Eduardo Habkost wrote:
> On Mon, Nov 13, 2017 at 09:32:09AM +0100, Paolo Bonzini wrote:
>> On 13/11/2017 08:15, Wanpeng Li wrote:
>>> 2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>>>> Sometimes, a processor might execute an instruction while another
>>>> processor is updating the page tables for that instruction's code page,
>>>> but before the TLB shootdown completes.  The interesting case happens
>>>> if the page is in the TLB.
>>>>
>>>> In general, the processor will succeed in executing the instruction and
>>>> nothing bad happens.  However, what if the instruction is an MMIO access?
>>>> If *that* happens, KVM invokes the emulator, and the emulator gets the
>>>> updated page tables.  If the update side had marked the code page as non
>>>> present, the page table walk then will fail and so will x86_decode_insn.
>>>>
>>>> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
>>>> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
>>>> a fatal error if the instruction cannot simply be reexecuted (as is the
>>>> case for MMIO).  And this in fact happened sometimes when rebooting
>>>> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
>>>> the exception if true is enough to fix the case.
>>>
>>> I found the only place which can set ctxt->have_exception is in the
>>> function x86_emulate_insn(), and x86_decode_insn() will not set
>>> ctxt->have_exception even if kvm_fetch_guest_virt() returns
>>> X86_EMUL_PROPAGATE_FAULT.
>>
>> Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
>> this patch! :(
>>
>> Yanan, can you double check that you can reproduce the issue with an
>> unpatched kernel?  I will work on a kvm-unit-tests testcsae
> 
> We don't have a kvm-unit-tests reproducer for this yet, right?
> 
> I'm considering trying to write one, but I don't want to
> duplicate work.

No, I haven't written one yet.

Paolo
Eduardo Habkost Nov. 29, 2017, 6:42 p.m. UTC | #8
On Wed, Nov 29, 2017 at 12:44:42PM +0100, Paolo Bonzini wrote:
> On 29/11/2017 12:44, Eduardo Habkost wrote:
> > On Mon, Nov 13, 2017 at 09:32:09AM +0100, Paolo Bonzini wrote:
> >> On 13/11/2017 08:15, Wanpeng Li wrote:
> >>> 2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> >>>> Sometimes, a processor might execute an instruction while another
> >>>> processor is updating the page tables for that instruction's code page,
> >>>> but before the TLB shootdown completes.  The interesting case happens
> >>>> if the page is in the TLB.
> >>>>
> >>>> In general, the processor will succeed in executing the instruction and
> >>>> nothing bad happens.  However, what if the instruction is an MMIO access?
> >>>> If *that* happens, KVM invokes the emulator, and the emulator gets the
> >>>> updated page tables.  If the update side had marked the code page as non
> >>>> present, the page table walk then will fail and so will x86_decode_insn.
> >>>>
> >>>> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> >>>> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> >>>> a fatal error if the instruction cannot simply be reexecuted (as is the
> >>>> case for MMIO).  And this in fact happened sometimes when rebooting
> >>>> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> >>>> the exception if true is enough to fix the case.
> >>>
> >>> I found the only place which can set ctxt->have_exception is in the
> >>> function x86_emulate_insn(), and x86_decode_insn() will not set
> >>> ctxt->have_exception even if kvm_fetch_guest_virt() returns
> >>> X86_EMUL_PROPAGATE_FAULT.
> >>
> >> Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
> >> this patch! :(
> >>
> >> Yanan, can you double check that you can reproduce the issue with an
> >> unpatched kernel?  I will work on a kvm-unit-tests testcsae
> > 
> > We don't have a kvm-unit-tests reproducer for this yet, right?
> > 
> > I'm considering trying to write one, but I don't want to
> > duplicate work.
> 
> No, I haven't written one yet.

The reproducer (not a full test case) is quite simple, see patch below.

Now, I've noticed something interesting when running the
reproducer:

If the test_fetch_failure() call happens before we touch
pci-testdev through *mem (like in the patch below), we get an
emulation failure like the one Yanan saw:

  $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.RCPjppRp8i
  enabling apic
  paging enabled
  cr0 = 80010011
  cr3 = 45e000
  cr4 = 20
  KVM internal error. Suberror: 1
  emulation failure
  RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000
  RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000
  R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
  R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
  RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
  SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS   [-WA]
  LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
  TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy
  GDT=     000000000041100a 0000047f
  IDT=     0000000000000000 00000fff
  CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020
  DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
  DR6=00000000ffff0ff0 DR7=0000000000000400
  EFER=0000000000000500
  Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

but if I call test_fetch_failure() after touching *mem, like this:

    diff --git a/x86/emulator.c b/x86/emulator.c
    index 977ec75..72cb035 100644
    --- a/x86/emulator.c
    +++ b/x86/emulator.c
    @@ -1124,7 +1124,6 @@ int main()
            alt_insn_page = alloc_page();
            insn_ram = vmap(virt_to_phys(insn_page), 4096);
    
    -       test_fetch_failure(mem, alt_insn_page);
    
            // test mov reg, r/m and mov r/m, reg
            t1 = 0x123456789abcdef;
    @@ -1135,6 +1134,8 @@ int main()
                         : "memory");
            report("mov reg, r/m (1)", t2 == 0x123456789abcdef);
    
    +       test_fetch_failure(mem, alt_insn_page);
    +
            test_simplealu(mem);
            test_cmps(mem);
            test_scas(mem);

then I get a KVM_INTERNAL_ERROR_DELIVERY_EV:

    $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.lmXZa46TEA
    enabling apic
    paging enabled
    cr0 = 80010011
    cr3 = 45e000
    cr4 = 20
    PASS: mov reg, r/m (1)
    KVM internal error. Suberror: 3
    extra data[0]: 80000b0e
    extra data[1]: 31
    extra data[2]: 182
    extra data[3]: ff000ff8
    RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000
    RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000
    R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
    R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
    RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
    ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
    CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
    SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
    DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
    FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
    GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS   [-WA]
    LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
    TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy
    GDT=     000000000041100a 0000047f
    IDT=     0000000000000000 00000fff
    CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020
    DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
    DR6=00000000ffff0ff0 DR7=0000000000000400
    EFER=0000000000000500
    Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
    ^C

Also, if I run the reproducer using ept=0, it gets stuck into a
loop re-entering the same "in (%dx),%al" instruction over and
over again.  trace-cmd report output:

    qemu-system-x86-18185 [001] 1057573.830491: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
    qemu-system-x86-18185 [001] 1057573.830494: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
    qemu-system-x86-18185 [001] 1057573.830503: kvm_entry:            vcpu 0
    qemu-system-x86-18185 [001] 1057573.830504: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
    qemu-system-x86-18185 [001] 1057573.830505: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
    qemu-system-x86-18185 [001] 1057573.830506: kvm_entry:            vcpu 0
    qemu-system-x86-18185 [001] 1057573.830507: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
    qemu-system-x86-18185 [001] 1057573.830508: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
    qemu-system-x86-18185 [001] 1057573.830509: kvm_entry:            vcpu 0
    qemu-system-x86-18185 [001] 1057573.830510: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
    qemu-system-x86-18185 [001] 1057573.830511: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
    qemu-system-x86-18185 [001] 1057573.830511: kvm_entry:            vcpu 0
    qemu-system-x86-18185 [001] 1057573.830512: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
    qemu-system-x86-18185 [001] 1057573.830513: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
    qemu-system-x86-18185 [001] 1057573.830514: kvm_entry:            vcpu 0
    qemu-system-x86-18185 [001] 1057573.830514: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
    qemu-system-x86-18185 [001] 1057573.830515: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
    qemu-system-x86-18185 [001] 1057573.830516: kvm_entry:            vcpu 0
    qemu-system-x86-18185 [001] 1057573.830517: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
    qemu-system-x86-18185 [001] 1057573.830518: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
    qemu-system-x86-18185 [001] 1057573.830519: kvm_entry:            vcpu 0
    qemu-system-x86-18185 [001] 1057573.830521: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
    qemu-system-x86-18185 [001] 1057573.830522: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
    qemu-system-x86-18185 [001] 1057573.830523: kvm_entry:            vcpu 0
    [...]

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
---
 x86/emulator.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/x86/emulator.c b/x86/emulator.c
index e6f27cc..977ec75 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -792,9 +792,11 @@ static void trap_emulator(uint64_t *mem, void *alt_insn_page,
 	extern u8 insn_page[], test_insn[];
 
 	insn_ram = vmap(virt_to_phys(insn_page), 4096);
-	memcpy(alt_insn_page, insn_page, 4096);
-	memcpy(alt_insn_page + (test_insn - insn_page),
-			(void *)(alt_insn->ptr), alt_insn->len);
+	if (alt_insn_page) {
+		memcpy(alt_insn_page, insn_page, 4096);
+		memcpy(alt_insn_page + (test_insn - insn_page),
+				(void *)(alt_insn->ptr), alt_insn->len);
+	}
 	save = inregs;
 
 	/* Load the code TLB with insn_page, but point the page tables at
@@ -805,7 +807,11 @@ static void trap_emulator(uint64_t *mem, void *alt_insn_page,
 	invlpg(insn_ram);
 	/* Load code TLB */
 	asm volatile("call *%0" : : "r"(insn_ram));
-	install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
+	if (alt_insn_page) {
+		install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
+	} else {
+		install_pte(cr3, 1, insn_ram, PT_USER_MASK, 0);
+	}
 	/* Trap, let hypervisor emulate at alt_insn_page */
 	asm volatile("call *%0": : "r"(insn_ram+1));
 
@@ -1096,6 +1102,11 @@ static void test_illegal_movbe(void)
 	handle_exception(UD_VECTOR, 0);
 }
 
+static void test_fetch_failure(void *mem, void *alt_insn_page)
+{
+	trap_emulator(mem, NULL, NULL);
+}
+
 int main()
 {
 	void *mem;
@@ -1113,6 +1124,8 @@ int main()
 	alt_insn_page = alloc_page();
 	insn_ram = vmap(virt_to_phys(insn_page), 4096);
 
+	test_fetch_failure(mem, alt_insn_page);
+
 	// test mov reg, r/m and mov r/m, reg
 	t1 = 0x123456789abcdef;
 	asm volatile("mov %[t1], (%[mem]) \n\t"
Paolo Bonzini Nov. 29, 2017, 10:47 p.m. UTC | #9
On 29/11/2017 19:42, Eduardo Habkost wrote:
> The reproducer (not a full test case) is quite simple, see patch below.

Great, thanks.  I assume that the patch doesn't fix it?!?

Paolo

> Now, I've noticed something interesting when running the
> reproducer:
> 
> If the test_fetch_failure() call happens before we touch
> pci-testdev through *mem (like in the patch below), we get an
> emulation failure like the one Yanan saw:
> 
>   $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.RCPjppRp8i
>   enabling apic
>   paging enabled
>   cr0 = 80010011
>   cr3 = 45e000
>   cr4 = 20
>   KVM internal error. Suberror: 1
>   emulation failure
>   RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000
>   RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000
>   R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
>   R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>   RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
>   ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>   SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS   [-WA]
>   LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>   TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy
>   GDT=     000000000041100a 0000047f
>   IDT=     0000000000000000 00000fff
>   CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020
>   DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
>   DR6=00000000ffff0ff0 DR7=0000000000000400
>   EFER=0000000000000500
>   Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
Eduardo Habkost Nov. 29, 2017, 11:10 p.m. UTC | #10
On Wed, Nov 29, 2017 at 11:47:14PM +0100, Paolo Bonzini wrote:
> On 29/11/2017 19:42, Eduardo Habkost wrote:
> > The reproducer (not a full test case) is quite simple, see patch below.
> 
> Great, thanks.  I assume that the patch doesn't fix it?!?

I was so convinced that it was impossible for the patch to fix
the problem, that I forgot to test it.  :)

I will test it tomorrow and let you know.


> 
> Paolo
> 
> > Now, I've noticed something interesting when running the
> > reproducer:
> > 
> > If the test_fetch_failure() call happens before we touch
> > pci-testdev through *mem (like in the patch below), we get an
> > emulation failure like the one Yanan saw:
> > 
> >   $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.RCPjppRp8i
> >   enabling apic
> >   paging enabled
> >   cr0 = 80010011
> >   cr3 = 45e000
> >   cr4 = 20
> >   KVM internal error. Suberror: 1
> >   emulation failure
> >   RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000
> >   RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000
> >   R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
> >   R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
> >   RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
> >   ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> >   CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> >   SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> >   DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> >   FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> >   GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS   [-WA]
> >   LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
> >   TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy
> >   GDT=     000000000041100a 0000047f
> >   IDT=     0000000000000000 00000fff
> >   CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020
> >   DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> >   DR6=00000000ffff0ff0 DR7=0000000000000400
> >   EFER=0000000000000500
> >   Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
>
Wanpeng Li Nov. 30, 2017, 9:20 a.m. UTC | #11
2017-11-30 2:42 GMT+08:00 Eduardo Habkost <ehabkost@redhat.com>:
> On Wed, Nov 29, 2017 at 12:44:42PM +0100, Paolo Bonzini wrote:
>> On 29/11/2017 12:44, Eduardo Habkost wrote:
>> > On Mon, Nov 13, 2017 at 09:32:09AM +0100, Paolo Bonzini wrote:
>> >> On 13/11/2017 08:15, Wanpeng Li wrote:
>> >>> 2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>> >>>> Sometimes, a processor might execute an instruction while another
>> >>>> processor is updating the page tables for that instruction's code page,
>> >>>> but before the TLB shootdown completes.  The interesting case happens
>> >>>> if the page is in the TLB.
>> >>>>
>> >>>> In general, the processor will succeed in executing the instruction and
>> >>>> nothing bad happens.  However, what if the instruction is an MMIO access?
>> >>>> If *that* happens, KVM invokes the emulator, and the emulator gets the
>> >>>> updated page tables.  If the update side had marked the code page as non
>> >>>> present, the page table walk then will fail and so will x86_decode_insn.
>> >>>>
>> >>>> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
>> >>>> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
>> >>>> a fatal error if the instruction cannot simply be reexecuted (as is the
>> >>>> case for MMIO).  And this in fact happened sometimes when rebooting
>> >>>> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
>> >>>> the exception if true is enough to fix the case.
>> >>>
>> >>> I found the only place which can set ctxt->have_exception is in the
>> >>> function x86_emulate_insn(), and x86_decode_insn() will not set
>> >>> ctxt->have_exception even if kvm_fetch_guest_virt() returns
>> >>> X86_EMUL_PROPAGATE_FAULT.
>> >>
>> >> Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
>> >> this patch! :(
>> >>
>> >> Yanan, can you double check that you can reproduce the issue with an
>> >> unpatched kernel?  I will work on a kvm-unit-tests testcsae
>> >
>> > We don't have a kvm-unit-tests reproducer for this yet, right?
>> >
>> > I'm considering trying to write one, but I don't want to
>> > duplicate work.
>>
>> No, I haven't written one yet.
>
> The reproducer (not a full test case) is quite simple, see patch below.

I can also have a look if there is a formal test case. :)

Regards,
Wanpeng Li
Paolo Bonzini Nov. 30, 2017, 4 p.m. UTC | #12
On 30/11/2017 10:20, Wanpeng Li wrote:
>>>> I'm considering trying to write one, but I don't want to
>>>> duplicate work.
>>> No, I haven't written one yet.
>> The reproducer (not a full test case) is quite simple, see patch below.
> I can also have a look if there is a formal test case. :)

FWIW, the patch does not fix it.

Paolo
Eduardo Habkost Nov. 30, 2017, 4:04 p.m. UTC | #13
On Wed, Nov 29, 2017 at 09:10:47PM -0200, Eduardo Habkost wrote:
> On Wed, Nov 29, 2017 at 11:47:14PM +0100, Paolo Bonzini wrote:
> > On 29/11/2017 19:42, Eduardo Habkost wrote:
> > > The reproducer (not a full test case) is quite simple, see patch below.
> > 
> > Great, thanks.  I assume that the patch doesn't fix it?!?
> 
> I was so convinced that it was impossible for the patch to fix
> the problem, that I forgot to test it.  :)
> 
> I will test it tomorrow and let you know.

I just confirmed that the patch doesn't fix it.

> 
> 
> > 
> > Paolo
> > 
> > > Now, I've noticed something interesting when running the
> > > reproducer:
> > > 
> > > If the test_fetch_failure() call happens before we touch
> > > pci-testdev through *mem (like in the patch below), we get an
> > > emulation failure like the one Yanan saw:
> > > 
> > >   $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.RCPjppRp8i
> > >   enabling apic
> > >   paging enabled
> > >   cr0 = 80010011
> > >   cr3 = 45e000
> > >   cr4 = 20
> > >   KVM internal error. Suberror: 1
> > >   emulation failure
> > >   RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000
> > >   RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000
> > >   R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
> > >   R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
> > >   RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
> > >   ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> > >   CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> > >   SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> > >   DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> > >   FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> > >   GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS   [-WA]
> > >   LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
> > >   TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy
> > >   GDT=     000000000041100a 0000047f
> > >   IDT=     0000000000000000 00000fff
> > >   CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020
> > >   DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> > >   DR6=00000000ffff0ff0 DR7=0000000000000400
> > >   EFER=0000000000000500
> > >   Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
> > 
> 
> -- 
> Eduardo
Eduardo Habkost Nov. 30, 2017, 8:33 p.m. UTC | #14
On Wed, Nov 29, 2017 at 04:42:16PM -0200, Eduardo Habkost wrote:
> On Wed, Nov 29, 2017 at 12:44:42PM +0100, Paolo Bonzini wrote:
> > On 29/11/2017 12:44, Eduardo Habkost wrote:
> > > On Mon, Nov 13, 2017 at 09:32:09AM +0100, Paolo Bonzini wrote:
> > >> On 13/11/2017 08:15, Wanpeng Li wrote:
> > >>> 2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> > >>>> Sometimes, a processor might execute an instruction while another
> > >>>> processor is updating the page tables for that instruction's code page,
> > >>>> but before the TLB shootdown completes.  The interesting case happens
> > >>>> if the page is in the TLB.
> > >>>>
> > >>>> In general, the processor will succeed in executing the instruction and
> > >>>> nothing bad happens.  However, what if the instruction is an MMIO access?
> > >>>> If *that* happens, KVM invokes the emulator, and the emulator gets the
> > >>>> updated page tables.  If the update side had marked the code page as non
> > >>>> present, the page table walk then will fail and so will x86_decode_insn.
> > >>>>
> > >>>> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> > >>>> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> > >>>> a fatal error if the instruction cannot simply be reexecuted (as is the
> > >>>> case for MMIO).  And this in fact happened sometimes when rebooting
> > >>>> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> > >>>> the exception if true is enough to fix the case.
> > >>>
> > >>> I found the only place which can set ctxt->have_exception is in the
> > >>> function x86_emulate_insn(), and x86_decode_insn() will not set
> > >>> ctxt->have_exception even if kvm_fetch_guest_virt() returns
> > >>> X86_EMUL_PROPAGATE_FAULT.
> > >>
> > >> Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
> > >> this patch! :(
> > >>
> > >> Yanan, can you double check that you can reproduce the issue with an
> > >> unpatched kernel?  I will work on a kvm-unit-tests testcsae
> > > 
> > > We don't have a kvm-unit-tests reproducer for this yet, right?
> > > 
> > > I'm considering trying to write one, but I don't want to
> > > duplicate work.
> > 
> > No, I haven't written one yet.
> 
> The reproducer (not a full test case) is quite simple, see patch below.
> 
> Now, I've noticed something interesting when running the
> reproducer:

There's something else that makes the bug hard to reproduce: as
soon as I set RSP to a valid address in inregs before calling
trap_emulator(), the bug is not reproducible anymore.

But if I keep RSP=0, I won't be able to validate the bug fix
because I won't be able to configure a working #PF handler.

This alone makes the bug not reproducible anymore:

diff --git a/x86/emulator.c b/x86/emulator.c
index 72cb035..a7e61ff 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -1104,6 +1104,8 @@ static void test_illegal_movbe(void)

 static void test_fetch_failure(void *mem, void *alt_insn_page)
 {
+       void *stack = alloc_page();
+       inregs = (struct regs){ .rsp = (u64)stack+1024 };
        trap_emulator(mem, NULL, NULL);
 }


This is what I see:

When we don't have a stack (inregs.rsp=0),
reexecute_instruction() is preventing the emulation failure from
happening on the I/O instruction VM exits, and KVM keeps entering
the VM in a loop (getting thousands of I/O instruction VM exits)
until we finally get an EPT misconfig VM exit on GVA
0xfffffffffffffff8.

When we set up inregs.rsp, reexecute_instruction() also prevents
the emulation from failing on the I/O instruction VM exits, but
instead of a EPT misconfig VM exit, we get EPT violation VM exit
after a few thousand iterations, and the page fault is delivered
to the VCPU.

I don't know why KVM loops so many times on I/O instruction VM
exits before finally getting an emulation failure (or finally
delivering a page fault, if a stack is available), but this might
explain why the bug is so hard to reproduce under normal
circumstances.



> 
> If the test_fetch_failure() call happens before we touch
> pci-testdev through *mem (like in the patch below), we get an
> emulation failure like the one Yanan saw:
> 
>   $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.RCPjppRp8i
>   enabling apic
>   paging enabled
>   cr0 = 80010011
>   cr3 = 45e000
>   cr4 = 20
>   KVM internal error. Suberror: 1
>   emulation failure
>   RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000
>   RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000
>   R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
>   R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>   RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
>   ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>   SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS   [-WA]
>   LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>   TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy
>   GDT=     000000000041100a 0000047f
>   IDT=     0000000000000000 00000fff
>   CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020
>   DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
>   DR6=00000000ffff0ff0 DR7=0000000000000400
>   EFER=0000000000000500
>   Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
> 
> but if I call test_fetch_failure() after touching *mem, like this:
> 
>     diff --git a/x86/emulator.c b/x86/emulator.c
>     index 977ec75..72cb035 100644
>     --- a/x86/emulator.c
>     +++ b/x86/emulator.c
>     @@ -1124,7 +1124,6 @@ int main()
>             alt_insn_page = alloc_page();
>             insn_ram = vmap(virt_to_phys(insn_page), 4096);
>     
>     -       test_fetch_failure(mem, alt_insn_page);
>     
>             // test mov reg, r/m and mov r/m, reg
>             t1 = 0x123456789abcdef;
>     @@ -1135,6 +1134,8 @@ int main()
>                          : "memory");
>             report("mov reg, r/m (1)", t2 == 0x123456789abcdef);
>     
>     +       test_fetch_failure(mem, alt_insn_page);
>     +
>             test_simplealu(mem);
>             test_cmps(mem);
>             test_scas(mem);
> 
> then I get a KVM_INTERNAL_ERROR_DELIVERY_EV:
> 
>     $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.lmXZa46TEA
>     enabling apic
>     paging enabled
>     cr0 = 80010011
>     cr3 = 45e000
>     cr4 = 20
>     PASS: mov reg, r/m (1)
>     KVM internal error. Suberror: 3
>     extra data[0]: 80000b0e
>     extra data[1]: 31
>     extra data[2]: 182
>     extra data[3]: ff000ff8
>     RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000
>     RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000
>     R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
>     R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>     RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
>     ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>     CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>     SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>     DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>     FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>     GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS   [-WA]
>     LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>     TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy
>     GDT=     000000000041100a 0000047f
>     IDT=     0000000000000000 00000fff
>     CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020
>     DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
>     DR6=00000000ffff0ff0 DR7=0000000000000400
>     EFER=0000000000000500
>     Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
>     ^C
> 
> Also, if I run the reproducer using ept=0, it gets stuck into a
> loop re-entering the same "in (%dx),%al" instruction over and
> over again.  trace-cmd report output:
> 
>     qemu-system-x86-18185 [001] 1057573.830491: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830494: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830503: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830504: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830505: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830506: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830507: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830508: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830509: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830510: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830511: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830511: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830512: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830513: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830514: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830514: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830515: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830516: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830517: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830518: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830519: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830521: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830522: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830523: kvm_entry:            vcpu 0
>     [...]
> 
> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
> ---
>  x86/emulator.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/x86/emulator.c b/x86/emulator.c
> index e6f27cc..977ec75 100644
> --- a/x86/emulator.c
> +++ b/x86/emulator.c
> @@ -792,9 +792,11 @@ static void trap_emulator(uint64_t *mem, void *alt_insn_page,
>  	extern u8 insn_page[], test_insn[];
>  
>  	insn_ram = vmap(virt_to_phys(insn_page), 4096);
> -	memcpy(alt_insn_page, insn_page, 4096);
> -	memcpy(alt_insn_page + (test_insn - insn_page),
> -			(void *)(alt_insn->ptr), alt_insn->len);
> +	if (alt_insn_page) {
> +		memcpy(alt_insn_page, insn_page, 4096);
> +		memcpy(alt_insn_page + (test_insn - insn_page),
> +				(void *)(alt_insn->ptr), alt_insn->len);
> +	}
>  	save = inregs;
>  
>  	/* Load the code TLB with insn_page, but point the page tables at
> @@ -805,7 +807,11 @@ static void trap_emulator(uint64_t *mem, void *alt_insn_page,
>  	invlpg(insn_ram);
>  	/* Load code TLB */
>  	asm volatile("call *%0" : : "r"(insn_ram));
> -	install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
> +	if (alt_insn_page) {
> +		install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
> +	} else {
> +		install_pte(cr3, 1, insn_ram, PT_USER_MASK, 0);
> +	}
>  	/* Trap, let hypervisor emulate at alt_insn_page */
>  	asm volatile("call *%0": : "r"(insn_ram+1));
>  
> @@ -1096,6 +1102,11 @@ static void test_illegal_movbe(void)
>  	handle_exception(UD_VECTOR, 0);
>  }
>  
> +static void test_fetch_failure(void *mem, void *alt_insn_page)
> +{
> +	trap_emulator(mem, NULL, NULL);
> +}
> +
>  int main()
>  {
>  	void *mem;
> @@ -1113,6 +1124,8 @@ int main()
>  	alt_insn_page = alloc_page();
>  	insn_ram = vmap(virt_to_phys(insn_page), 4096);
>  
> +	test_fetch_failure(mem, alt_insn_page);
> +
>  	// test mov reg, r/m and mov r/m, reg
>  	t1 = 0x123456789abcdef;
>  	asm volatile("mov %[t1], (%[mem]) \n\t"
> -- 
> 2.13.6
> 
> 
> -- 
> Eduardo

Patch
diff mbox series

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 34c85aa2e2d1..6dbed9022797 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5722,6 +5722,8 @@  int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 			if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
 						emulation_type))
 				return EMULATE_DONE;
+			if (ctxt->have_exception && inject_emulated_exception(vcpu))
+				return EMULATE_DONE;
 			if (emulation_type & EMULTYPE_SKIP)
 				return EMULATE_FAIL;
 			return handle_emulation_failure(vcpu);