kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ralf Ramsauer <ralf.ramsauer@oth-regensburg.de>
To: Jan Kiszka <jan.kiszka@siemens.com>,
	"Raslan, KarimAllah" <karahmed@amazon.de>,
	"jmattson@google.com" <jmattson@google.com>,
	"liran.alon@oracle.com" <liran.alon@oracle.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>
Subject: Re: KVM_SET_NESTED_STATE not yet stable
Date: Thu, 11 Jul 2019 13:37:17 +0200	[thread overview]
Message-ID: <47e8c75d-f39a-89f8-940f-d05a9bc91899@oth-regensburg.de> (raw)
In-Reply-To: <cfd86643-dbac-3a69-9faf-03eaa8aee6a1@siemens.com>

Hi all,

On 7/10/19 10:31 PM, Jan Kiszka wrote:
> On 10.07.19 18:05, Jan Kiszka wrote:
>> Hi KarimAllah,
>>
>> On 10.07.19 17:24, Raslan, KarimAllah wrote:
>>> On Mon, 2019-07-08 at 22:39 +0200, Jan Kiszka wrote:
>>>> Hi all,
>>>>
>>>> it seems the "new" KVM_SET_NESTED_STATE interface has some remaining
>>>> robustness issues.
>>>
>>> I would be very interested to learn about any more robustness issues that you 
>>> are seeing.
>>>
>>>> The most urgent one: With the help of latest QEMU
>>>> master that uses this interface, you can easily crash the host. You just
>>>> need to start qemu-system-x86 -enable-kvm in L1 and then hard-reset L1.
>>>> The host CPU that ran this will stall, the system will freeze soon.
>>>
>>> Just to confirm, you start an L2 guest using qemu inside an L1-guest and then 
>>> hard-reset the L1 guest?
>>
>> Exactly.
>>
>>>
>>> Are you running any special workload in L2 or L1 when you reset? Also how 
>>
>> Nope. It is a standard (though rather oldish) userland in L1, just running a
>> more recent kernel 5.2.
>>
>>> exactly are you doing this "hard reset"?
>>
>> system_reset from the monitor or "reset" from QEMU window menu.

While I'm not able to reproduce this behaviour on any of my machines
(i7-4810MQ, i7-5600U, Xeon Gold 5118),

>>
>>>
>>> (sorry just tried this in my setup and I did not see any problem but my setup
>>>  is slightly different, so just ruling out obvious stuff).
>>>
>>
>> If it helps, I can share privately a guest image that was built via
>> https://github.com/siemens/jailhouse-images which exposes the reset issue after
>> starting Jailhouse (instead of qemu-system-x86_64 - though that should "work" as
>> well, just not tested yet). It's about 70M packed.
>>
>> Host-wise, 5.2.0 + QEMU master should do. I can also provide you the .config if
>> needed.

I can reproduce and confirm this issue. A system_reset of qemu after
Jailhouse is enabled leads to the crash listed below, on all machines.

On the Xeon Gold, e.g., Qemu reports:

EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000f61
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00a09b00
SS =0000 00000000 0000ffff 00c09300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT=     00000000 0000ffff
IDT=     00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000680
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=00 66 89 d8 66 e8 af a1 ff ff 66 83 c4 0c 66 5b 66 5e 66 c3 <ea> 5b
e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00

Kernel:
[ 1868.804515] kvm: vmptrld           (null)/6b8640000000 failed
[ 1868.804568] kvm: vmclear fail:           (null)/6b8640000000

And the host freezes unrecoverably. Hosts use standard distro kernels
>= v5.0.

  Ralf

>>
>>>>
>>>> I've also seen a pattern with my Jailhouse test VM where I seems to get
>>>> stuck in a loop between L1 and L2:
>>>>
>>>>  qemu-system-x86-6660  [007]   398.691401: kvm_nested_vmexit:    rip 7fa9ee5224e4 reason IO_INSTRUCTION info1 5658000b info2 0 int_info 0 int_info_err 0
>>>>  qemu-system-x86-6660  [007]   398.691402: kvm_fpu:              unload
>>>>  qemu-system-x86-6660  [007]   398.691403: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
>>>>  qemu-system-x86-6660  [007]   398.691440: kvm_fpu:              load
>>>>  qemu-system-x86-6660  [007]   398.691441: kvm_pio:              pio_read at 0x5658 size 4 count 1 val 0x4 
>>>>  qemu-system-x86-6660  [007]   398.691443: kvm_mmu_get_page:     existing sp gfn 3a22e 1/4 q3 direct --x !pge !nxe root 6 sync
>>>>  qemu-system-x86-6660  [007]   398.691444: kvm_entry:            vcpu 3
>>>>  qemu-system-x86-6660  [007]   398.691475: kvm_exit:             reason IO_INSTRUCTION rip 0x7fa9ee5224e4 info 5658000b 0
>>>>  qemu-system-x86-6660  [007]   398.691476: kvm_nested_vmexit:    rip 7fa9ee5224e4 reason IO_INSTRUCTION info1 5658000b info2 0 int_info 0 int_info_err 0
>>>>  qemu-system-x86-6660  [007]   398.691477: kvm_fpu:              unload
>>>>  qemu-system-x86-6660  [007]   398.691478: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
>>>>  qemu-system-x86-6660  [007]   398.691526: kvm_fpu:              load
>>>>  qemu-system-x86-6660  [007]   398.691527: kvm_pio:              pio_read at 0x5658 size 4 count 1 val 0x4 
>>>>  qemu-system-x86-6660  [007]   398.691529: kvm_mmu_get_page:     existing sp gfn 3a22e 1/4 q3 direct --x !pge !nxe root 6 sync
>>>>  qemu-system-x86-6660  [007]   398.691530: kvm_entry:            vcpu 3
>>>>  qemu-system-x86-6660  [007]   398.691533: kvm_exit:             reason IO_INSTRUCTION rip 0x7fa9ee5224e4 info 5658000b 0
>>>>  qemu-system-x86-6660  [007]   398.691534: kvm_nested_vmexit:    rip 7fa9ee5224e4 reason IO_INSTRUCTION info1 5658000b info2 0 int_info 0 int_info_err 0
>>>>
>>>> These issues disappear when going from ebbfef2f back to 6cfd7639 (both
>>>> with build fixes) in QEMU.
>>>
>>> This is the QEMU that you are using in L0 to launch an L1 guest, right? or are 
>>> you still referring to the QEMU mentioned above?
>>
>> This scenario is similar but still a bit different than the above. Yes, same L0
>> image and host QEMU here (and the traces were taken on the host, obviously), but
>> the workload is now as follows:
>>
>>  - boot L1 Linux
>>  - enable Jailhouse inside L1
>>  - move the mouse over the graphical desktop of L2, ie. the former L1
>>    Linux (Jailhouse is now L1)
>>  - the L1/L2 guests enter the loop above while trying to read from the
>>    vmmouse port
>>
>> Jan
>>
> 
> Ralf tried my case on some of his systems as well but he also didn't succeed in
> reproducing. So we compared vmxcap lists because I'm starting to think it's
> feature-related. There are some differences...
> 
> --- vmxcap.i7-5600u	2019-07-10 21:59:05.616547924 +0200
> +++ vmxcap.jan	2019-07-10 21:58:23.135686409 +0200
> @@ -1,6 +1,6 @@
>  Basic VMX Information
> -  Hex: 0xda040000000012
> -  Revision                                 18
> +  Hex: 0xda040000000004
> +  Revision                                 4
>    VMCS size                                1024
>    VMCS restricted to 32 bit addresses      no
>    Dual-monitor support                     yes
> @@ -51,13 +51,13 @@
>    Enable INVPCID                           yes
>    Enable VM functions                      yes
>    VMCS shadowing                           yes
> -  Enable ENCLS exiting                     no
> +  Enable ENCLS exiting                     yes
>    RDSEED exiting                           yes
> -  Enable PML                               no
> +  Enable PML                               yes
>    EPT-violation #VE                        yes
> -  Conceal non-root operation from PT       no
> -  Enable XSAVES/XRSTORS                    no
> -  Mode-based execute control (XS/XU)       no
> +  Conceal non-root operation from PT       yes
> +  Enable XSAVES/XRSTORS                    yes
> +  Mode-based execute control (XS/XU)       yes
>    TSC scaling                              no
>  VM-Exit controls
>    Save debug controls                      default
> @@ -69,8 +69,8 @@
>    Save IA32_EFER                           yes
>    Load IA32_EFER                           yes
>    Save VMX-preemption timer value          yes
> -  Clear IA32_BNDCFGS                       no
> -  Conceal VM exits from PT                 no
> +  Clear IA32_BNDCFGS                       yes
> +  Conceal VM exits from PT                 yes
>  VM-Entry controls
>    Load debug controls                      default
>    IA-32e mode guest                        yes
> @@ -79,11 +79,11 @@
>    Load IA32_PERF_GLOBAL_CTRL               yes
>    Load IA32_PAT                            yes
>    Load IA32_EFER                           yes
> -  Load IA32_BNDCFGS                        no
> -  Conceal VM entries from PT               no
> +  Load IA32_BNDCFGS                        yes
> +  Conceal VM entries from PT               yes
>  Miscellaneous data
> -  Hex: 0x300481e5
> -  VMX-preemption timer scale (log2)        5
> +  Hex: 0x7004c1e7
> +  VMX-preemption timer scale (log2)        7
>    Store EFER.LMA into IA-32e mode guest control yes
>    HLT activity state                       yes
>    Shutdown activity state                  yes
> @@ -93,10 +93,10 @@
>    MSR-load/store count recommendation      0
>    IA32_SMM_MONITOR_CTL[2] can be set to 1  yes
>    VMWRITE to VM-exit information fields    yes
> -  Inject event with insn length=0          no
> +  Inject event with insn length=0          yes
>    MSEG revision identifier                 0
>  VPID and EPT capabilities
> -  Hex: 0xf0106334141
> +  Hex: 0xf0106734141
>    Execute-only EPT translations            yes
>    Page-walk length 4                       yes
>    Paging-structure memory type UC          yes
> 
> And another machine that does not crash:
> 
> --- vmxcaps.e5-2683v4	2019-07-10 22:21:28.620329384 +0200
> +++ vmxcap.jan	2019-07-10 21:58:23.135686409 +0200
> @@ -1,6 +1,6 @@
>  Basic VMX Information
> -  Hex: 0xda040000000012
> -  Revision                                 18
> +  Hex: 0xda040000000004
> +  Revision                                 4
>    VMCS size                                1024
>    VMCS restricted to 32 bit addresses      no
>    Dual-monitor support                     yes
> @@ -12,7 +12,7 @@
>    NMI exiting                              yes
>    Virtual NMIs                             yes
>    Activate VMX-preemption timer            yes
> -  Process posted interrupts                yes
> +  Process posted interrupts                no
>  primary processor-based controls
>    Interrupt window exiting                 yes
>    Use TSC offsetting                       yes
> @@ -44,20 +44,20 @@
>    Enable VPID                              yes
>    WBINVD exiting                           yes
>    Unrestricted guest                       yes
> -  APIC register emulation                  yes
> -  Virtual interrupt delivery               yes
> +  APIC register emulation                  no
> +  Virtual interrupt delivery               no
>    PAUSE-loop exiting                       yes
>    RDRAND exiting                           yes
>    Enable INVPCID                           yes
>    Enable VM functions                      yes
>    VMCS shadowing                           yes
> -  Enable ENCLS exiting                     no
> +  Enable ENCLS exiting                     yes
>    RDSEED exiting                           yes
>    Enable PML                               yes
>    EPT-violation #VE                        yes
> -  Conceal non-root operation from PT       no
> -  Enable XSAVES/XRSTORS                    no
> -  Mode-based execute control (XS/XU)       no
> +  Conceal non-root operation from PT       yes
> +  Enable XSAVES/XRSTORS                    yes
> +  Mode-based execute control (XS/XU)       yes
>    TSC scaling                              no
>  VM-Exit controls
>    Save debug controls                      default
> @@ -69,8 +69,8 @@
>    Save IA32_EFER                           yes
>    Load IA32_EFER                           yes
>    Save VMX-preemption timer value          yes
> -  Clear IA32_BNDCFGS                       no
> -  Conceal VM exits from PT                 no
> +  Clear IA32_BNDCFGS                       yes
> +  Conceal VM exits from PT                 yes
>  VM-Entry controls
>    Load debug controls                      default
>    IA-32e mode guest                        yes
> @@ -79,11 +79,11 @@
>    Load IA32_PERF_GLOBAL_CTRL               yes
>    Load IA32_PAT                            yes
>    Load IA32_EFER                           yes
> -  Load IA32_BNDCFGS                        no
> -  Conceal VM entries from PT               no
> +  Load IA32_BNDCFGS                        yes
> +  Conceal VM entries from PT               yes
>  Miscellaneous data
> -  Hex: 0x300481e5
> -  VMX-preemption timer scale (log2)        5
> +  Hex: 0x7004c1e7
> +  VMX-preemption timer scale (log2)        7
>    Store EFER.LMA into IA-32e mode guest control yes
>    HLT activity state                       yes
>    Shutdown activity state                  yes
> @@ -93,10 +93,10 @@
>    MSR-load/store count recommendation      0
>    IA32_SMM_MONITOR_CTL[2] can be set to 1  yes
>    VMWRITE to VM-exit information fields    yes
> -  Inject event with insn length=0          no
> +  Inject event with insn length=0          yes
>    MSEG revision identifier                 0
>  VPID and EPT capabilities
> -  Hex: 0xf0106334141
> +  Hex: 0xf0106734141
>    Execute-only EPT translations            yes
>    Page-walk length 4                       yes
>    Paging-structure memory type UC          yes
> 
> And on a Xeon D-1540, I'm not seeing a crash but a kvm entry failure when
> resetting L1 while running Jailhouse:
> 
> KVM: entry failed, hardware error 0x7
> EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000f61
> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
> EIP=0000fff0 EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 00000000 0000ffff 00009300
> CS =f000 ffff0000 0000ffff 00a09b00
> SS =0000 00000000 0000ffff 00c09300
> DS =0000 00000000 0000ffff 00009300
> FS =0000 00000000 0000ffff 00009300
> GS =0000 00000000 0000ffff 00009300
> LDT=0000 00000000 0000ffff 00008200
> TR =0000 00000000 0000ffff 00008b00
> GDT=     00000000 0000ffff
> IDT=     00000000 0000ffff
> CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000680
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000000
> Code=00 66 89 d8 66 e8 af a1 ff ff 66 83 c4 0c 66 5b 66 5e 66 c3 <ea> 5b e0 00
> f0 30 36 2f 32 33 2f 39 39 00 fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Here is the vmxcap diff:
> 
> --- xeon-d	2019-07-10 22:29:56.735374032 +0200
> +++ i7-8850H	2019-07-10 22:29:31.747467248 +0200
> @@ -1,6 +1,6 @@
>  Basic VMX Information
> -  Hex: 0xda040000000012
> -  Revision                                 18
> +  Hex: 0xda040000000004
> +  Revision                                 4
>    VMCS size                                1024
>    VMCS restricted to 32 bit addresses      no
>    Dual-monitor support                     yes
> @@ -12,7 +12,7 @@ pin-based controls
>    NMI exiting                              yes
>    Virtual NMIs                             yes
>    Activate VMX-preemption timer            yes
> -  Process posted interrupts                yes
> +  Process posted interrupts                no
>  primary processor-based controls
>    Interrupt window exiting                 yes
>    Use TSC offsetting                       yes
> @@ -44,20 +44,20 @@ secondary processor-based controls
>    Enable VPID                              yes
>    WBINVD exiting                           yes
>    Unrestricted guest                       yes
> -  APIC register emulation                  yes
> -  Virtual interrupt delivery               yes
> +  APIC register emulation                  no
> +  Virtual interrupt delivery               no
>    PAUSE-loop exiting                       yes
>    RDRAND exiting                           yes
>    Enable INVPCID                           yes
>    Enable VM functions                      yes
>    VMCS shadowing                           yes
> -  Enable ENCLS exiting                     no
> +  Enable ENCLS exiting                     yes
>    RDSEED exiting                           yes
>    Enable PML                               yes
>    EPT-violation #VE                        yes
> -  Conceal non-root operation from PT       no
> -  Enable XSAVES/XRSTORS                    no
> -  Mode-based execute control (XS/XU)       no
> +  Conceal non-root operation from PT       yes
> +  Enable XSAVES/XRSTORS                    yes
> +  Mode-based execute control (XS/XU)       yes
>    TSC scaling                              no
>  VM-Exit controls
>    Save debug controls                      default
> @@ -69,8 +69,8 @@ VM-Exit controls
>    Save IA32_EFER                           yes
>    Load IA32_EFER                           yes
>    Save VMX-preemption timer value          yes
> -  Clear IA32_BNDCFGS                       no
> -  Conceal VM exits from PT                 no
> +  Clear IA32_BNDCFGS                       yes
> +  Conceal VM exits from PT                 yes
>  VM-Entry controls
>    Load debug controls                      default
>    IA-32e mode guest                        yes
> @@ -79,11 +79,11 @@ VM-Entry controls
>    Load IA32_PERF_GLOBAL_CTRL               yes
>    Load IA32_PAT                            yes
>    Load IA32_EFER                           yes
> -  Load IA32_BNDCFGS                        no
> -  Conceal VM entries from PT               no
> +  Load IA32_BNDCFGS                        yes
> +  Conceal VM entries from PT               yes
>  Miscellaneous data
> -  Hex: 0x300481e5
> -  VMX-preemption timer scale (log2)        5
> +  Hex: 0x7004c1e7
> +  VMX-preemption timer scale (log2)        7
>    Store EFER.LMA into IA-32e mode guest control yes
>    HLT activity state                       yes
>    Shutdown activity state                  yes
> @@ -93,10 +93,10 @@ Miscellaneous data
>    MSR-load/store count recommendation      0
>    IA32_SMM_MONITOR_CTL[2] can be set to 1  yes
>    VMWRITE to VM-exit information fields    yes
> -  Inject event with insn length=0          no
> +  Inject event with insn length=0          yes
>    MSEG revision identifier                 0
>  VPID and EPT capabilities
> -  Hex: 0xf0106334141
> +  Hex: 0xf0106734141
>    Execute-only EPT translations            yes
>    Page-walk length 4                       yes
>    Paging-structure memory type UC          yes
> 
> Maybe the KVM code does not take the latest VMX features into account when
> importing a userspace nested state?
> 
> Jan
> 

  parent reply	other threads:[~2019-07-11 11:45 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-08 20:39 KVM_SET_NESTED_STATE not yet stable Jan Kiszka
2019-07-10 15:24 ` Raslan, KarimAllah
2019-07-10 16:05   ` Jan Kiszka
2019-07-10 20:31     ` Jan Kiszka
2019-07-10 21:14       ` Jan Kiszka
2019-07-11 11:37       ` Ralf Ramsauer [this message]
2019-07-11 17:30         ` Paolo Bonzini
2019-07-19 16:38           ` Paolo Bonzini
2019-07-21  9:05             ` Jan Kiszka
2019-07-22 15:10               ` Ralf Ramsauer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47e8c75d-f39a-89f8-940f-d05a9bc91899@oth-regensburg.de \
    --to=ralf.ramsauer@oth-regensburg.de \
    --cc=jan.kiszka@siemens.com \
    --cc=jmattson@google.com \
    --cc=karahmed@amazon.de \
    --cc=kvm@vger.kernel.org \
    --cc=liran.alon@oracle.com \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).