All of lore.kernel.org
 help / color / mirror / Atom feed
* Guest reboot issues since QEMU 6.0 and Linux 5.11
@ 2022-07-21 12:49 Fabian Ebner
  2022-07-21 15:51 ` Maxim Levitsky
  2022-07-28 10:13 ` Yan Vugenfirer
  0 siblings, 2 replies; 5+ messages in thread
From: Fabian Ebner @ 2022-07-21 12:49 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: Thomas Lamprecht, Mira Limbeck

Hi,
since about half a year ago, we're getting user reports about guest
reboot issues with KVM/QEMU[0].

The most common scenario is a Windows Server VM (2012R2/2016/2019,
UEFI/OVMF and SeaBIOS) getting stuck during the screen with the Windows
logo and the spinning circles after a reboot was triggered from within
the guest. Quitting the kvm process and booting with a fresh instance
works. The issue seems to become more likely, the longer the kvm
instance runs.

We did not get such reports while we were providing Linux 5.4 and QEMU
5.2.0, but we do with Linux 5.11/5.13/5.15 and QEMU 6.x.

I'm just wondering if anybody has seen this issue before or might have a
hunch what it's about? Any tips on what to look out for when debugging
are also greatly appreciated!

We do have debug access to a user's test VM and the VM state was saved
before a problematic reboot, but I can't modify the host system there.
AFAICT QEMU just executes guest code as usual, but I'm really not sure
what to look out for.

That VM has CPU type host, and a colleague did have a similar enough CPU
to load the VM state, but for him, the reboot went through normally. On
the user's system, it triggers consistently after loading the VM state
and rebooting.

So unfortunately, we didn't manage to reproduce the issue locally yet.
With two other images provided by users, we ran into a boot loop, where
QEMU resets the CPUs and does a few KVM_RUNs before the exit reason is
KVM_EXIT_SHUTDOWN (which to my understanding indicates a triple fault)
and then it repeats. It's not clear if the issues are related.

There are also a few reports about non-Windows VMs, mostly Ubuntu 20.04
with UEFI/OVMF, but again, it's not clear if the issues are related.

[0]: https://forum.proxmox.com/threads/100744/
(the forum thread is a bit chaotic unfortunately).

Best Regards,
Fabi


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Guest reboot issues since QEMU 6.0 and Linux 5.11
  2022-07-21 12:49 Guest reboot issues since QEMU 6.0 and Linux 5.11 Fabian Ebner
@ 2022-07-21 15:51 ` Maxim Levitsky
  2022-07-22 12:28   ` Fiona Ebner
  2022-07-28 10:13 ` Yan Vugenfirer
  1 sibling, 1 reply; 5+ messages in thread
From: Maxim Levitsky @ 2022-07-21 15:51 UTC (permalink / raw)
  To: Fabian Ebner, kvm, qemu-devel; +Cc: Thomas Lamprecht, Mira Limbeck

On Thu, 2022-07-21 at 14:49 +0200, Fabian Ebner wrote:
> Hi,
> since about half a year ago, we're getting user reports about guest
> reboot issues with KVM/QEMU[0].
> 
> The most common scenario is a Windows Server VM (2012R2/2016/2019,
> UEFI/OVMF and SeaBIOS) getting stuck during the screen with the Windows
> logo and the spinning circles after a reboot was triggered from within
> the guest. Quitting the kvm process and booting with a fresh instance
> works. The issue seems to become more likely, the longer the kvm
> instance runs.
> 
> We did not get such reports while we were providing Linux 5.4 and QEMU
> 5.2.0, but we do with Linux 5.11/5.13/5.15 and QEMU 6.x.
> 
> I'm just wondering if anybody has seen this issue before or might have a
> hunch what it's about? Any tips on what to look out for when debugging
> are also greatly appreciated!
> 
> We do have debug access to a user's test VM and the VM state was saved
> before a problematic reboot, but I can't modify the host system there.
> AFAICT QEMU just executes guest code as usual, but I'm really not sure
> what to look out for.
> 
> That VM has CPU type host, and a colleague did have a similar enough CPU
> to load the VM state, but for him, the reboot went through normally. On
> the user's system, it triggers consistently after loading the VM state
> and rebooting.
> 
> So unfortunately, we didn't manage to reproduce the issue locally yet.
> With two other images provided by users, we ran into a boot loop, where
> QEMU resets the CPUs and does a few KVM_RUNs before the exit reason is
> KVM_EXIT_SHUTDOWN (which to my understanding indicates a triple fa
> ult)
> and then it repeats. It's not clear if the issues are related.


Does the guest have HyperV enabled in it (that is nested virtualization?)

Intel or AMD?

Does the VM uses secure boot / SMM?

Best regards,
	Maxim Levitsky

> 
> There are also a few reports about non-Windows VMs, mostly Ubuntu 20.04
> with UEFI/OVMF, but again, it's not clear if the issues are related.
> 
> [0]: https://forum.proxmox.com/threads/100744/
> (the forum thread is a bit chaotic unfortunately).
> 
> Best Regards,
> Fabi
> 
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Guest reboot issues since QEMU 6.0 and Linux 5.11
  2022-07-21 15:51 ` Maxim Levitsky
@ 2022-07-22 12:28   ` Fiona Ebner
  0 siblings, 0 replies; 5+ messages in thread
From: Fiona Ebner @ 2022-07-22 12:28 UTC (permalink / raw)
  To: Maxim Levitsky, kvm, qemu-devel; +Cc: Thomas Lamprecht, Mira Limbeck

Am 21.07.22 um 17:51 schrieb Maxim Levitsky:
> On Thu, 2022-07-21 at 14:49 +0200, Fabian Ebner wrote:
>> Hi,
>> since about half a year ago, we're getting user reports about guest
>> reboot issues with KVM/QEMU[0].
>>
>> The most common scenario is a Windows Server VM (2012R2/2016/2019,
>> UEFI/OVMF and SeaBIOS) getting stuck during the screen with the Windows
>> logo and the spinning circles after a reboot was triggered from within
>> the guest. Quitting the kvm process and booting with a fresh instance
>> works. The issue seems to become more likely, the longer the kvm
>> instance runs.
>>
>> We did not get such reports while we were providing Linux 5.4 and QEMU
>> 5.2.0, but we do with Linux 5.11/5.13/5.15 and QEMU 6.x.
>>
>> I'm just wondering if anybody has seen this issue before or might have a
>> hunch what it's about? Any tips on what to look out for when debugging
>> are also greatly appreciated!
>>
>> We do have debug access to a user's test VM and the VM state was saved
>> before a problematic reboot, but I can't modify the host system there.
>> AFAICT QEMU just executes guest code as usual, but I'm really not sure
>> what to look out for.
>>
>> That VM has CPU type host, and a colleague did have a similar enough CPU
>> to load the VM state, but for him, the reboot went through normally. On
>> the user's system, it triggers consistently after loading the VM state
>> and rebooting.
>>
>> So unfortunately, we didn't manage to reproduce the issue locally yet.
>> With two other images provided by users, we ran into a boot loop, where
>> QEMU resets the CPUs and does a few KVM_RUNs before the exit reason is
>> KVM_EXIT_SHUTDOWN (which to my understanding indicates a triple fa
>> ult)
>> and then it repeats. It's not clear if the issues are related.
> 
> 
> Does the guest have HyperV enabled in it (that is nested virtualization?)
> 

For all three machines described above
Get-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V
indicates that HyperV is disabled.

> Intel or AMD?
> 

We do have reports for both Intel and AMD.

> Does the VM uses secure boot / SMM?
> 

The customer VM which can reliably trigger the issue after loading the
state and rebooting uses SeaBIOS. For the other two VMs,
Confirm-SecureBootUEFI
returns "False".

SMM might be a lead! We did disable SMM in the past, because apparently
there were problems with it (didn't dig out which, was before I worked
here), and the timing of enabling it and the reports coming in would
match. I guess (some) guest OSes don't expect it to be suddenly turned on?

However, there is a report of a user with two clusters with QEMU 5.2,
one with kernel 5.4 without the issue and one with kernel 5.11 with the
issue (Windows VM with spinning circles). So that's confusing :/


We do use some additional options if the OS type is "Windows" in our
high-level configuration, including hyperV enlightenments:

> -cpu 'host,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt'
> -no-hpet
> -rtc 'driftfix=slew,base=localtime'
> -global 'kvm-pit.lost_tick_policy=discard'

But one user reported running into the issue even with OS type "other",
i.e. when the above options are not present and CPU flags should be just
'+kvm_pv_eoi,+kvm_pv_unhalt'. There are also reports with CPU type
different from 'host', also with 'kvm64' (where we automatically set the
flags +lahf_lm,+sep).


Thank you and Best Regards,
Fiona

P.S. Please don't mind the (from your perspective sudden) name change.
I'm still the same person and don't intend to change it again :)

> Best regards,
> 	Maxim Levitsky
> 
>>
>> There are also a few reports about non-Windows VMs, mostly Ubuntu 20.04
>> with UEFI/OVMF, but again, it's not clear if the issues are related.
>>
>> [0]: https://forum.proxmox.com/threads/100744/
>> (the forum thread is a bit chaotic unfortunately).
>>
>> Best Regards,
>> Fabi
>>
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Guest reboot issues since QEMU 6.0 and Linux 5.11
  2022-07-21 12:49 Guest reboot issues since QEMU 6.0 and Linux 5.11 Fabian Ebner
  2022-07-21 15:51 ` Maxim Levitsky
@ 2022-07-28 10:13 ` Yan Vugenfirer
  2022-08-02 11:57   ` Fiona Ebner
  1 sibling, 1 reply; 5+ messages in thread
From: Yan Vugenfirer @ 2022-07-28 10:13 UTC (permalink / raw)
  To: Fabian Ebner; +Cc: kvm, QEMU Developers, Thomas Lamprecht, Mira Limbeck

Hi Fabian,

Can you save the dump file with QEMU monitor using dump-guest-memory or with virsh dump?
Then you can use elf2dmp (compiled with QEMU and is found in “contrib” folder) to covert the dump file to WinDbg format and examine the stack. 


Best regards,
Yan.


> On 21 Jul 2022, at 3:49 PM, Fabian Ebner <f.ebner@proxmox.com> wrote:
> 
> Hi,
> since about half a year ago, we're getting user reports about guest
> reboot issues with KVM/QEMU[0].
> 
> The most common scenario is a Windows Server VM (2012R2/2016/2019,
> UEFI/OVMF and SeaBIOS) getting stuck during the screen with the Windows
> logo and the spinning circles after a reboot was triggered from within
> the guest. Quitting the kvm process and booting with a fresh instance
> works. The issue seems to become more likely, the longer the kvm
> instance runs.
> 
> We did not get such reports while we were providing Linux 5.4 and QEMU
> 5.2.0, but we do with Linux 5.11/5.13/5.15 and QEMU 6.x.
> 
> I'm just wondering if anybody has seen this issue before or might have a
> hunch what it's about? Any tips on what to look out for when debugging
> are also greatly appreciated!
> 
> We do have debug access to a user's test VM and the VM state was saved
> before a problematic reboot, but I can't modify the host system there.
> AFAICT QEMU just executes guest code as usual, but I'm really not sure
> what to look out for.
> 
> That VM has CPU type host, and a colleague did have a similar enough CPU
> to load the VM state, but for him, the reboot went through normally. On
> the user's system, it triggers consistently after loading the VM state
> and rebooting.
> 
> So unfortunately, we didn't manage to reproduce the issue locally yet.
> With two other images provided by users, we ran into a boot loop, where
> QEMU resets the CPUs and does a few KVM_RUNs before the exit reason is
> KVM_EXIT_SHUTDOWN (which to my understanding indicates a triple fault)
> and then it repeats. It's not clear if the issues are related.
> 
> There are also a few reports about non-Windows VMs, mostly Ubuntu 20.04
> with UEFI/OVMF, but again, it's not clear if the issues are related.
> 
> [0]: https://forum.proxmox.com/threads/100744/
> (the forum thread is a bit chaotic unfortunately).
> 
> Best Regards,
> Fabi
> 
> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Guest reboot issues since QEMU 6.0 and Linux 5.11
  2022-07-28 10:13 ` Yan Vugenfirer
@ 2022-08-02 11:57   ` Fiona Ebner
  0 siblings, 0 replies; 5+ messages in thread
From: Fiona Ebner @ 2022-08-02 11:57 UTC (permalink / raw)
  To: Yan Vugenfirer; +Cc: kvm, QEMU Developers, Thomas Lamprecht, Mira Limbeck

Am 28.07.22 um 12:13 schrieb Yan Vugenfirer:
> Hi Fabian,
> 
> Can you save the dump file with QEMU monitor using dump-guest-memory or with virsh dump?
> Then you can use elf2dmp (compiled with QEMU and is found in “contrib” folder) to covert the dump file to WinDbg format and examine the stack. 
> 

Hi Yan,
thank you for the suggestion!

So for the two VMs in the KVM_EXIT_SHUTDOWN-qemu_system_reset-loop, I get

> 2 CPU states has been found
> CPU #0 CR3 is 0x0000000000800000
> DirectoryTableBase = 0x000fffffffd08000 has been found from CPU #0 as interrupt handling CR3
> [1]    4169758 segmentation fault (core dumped)  elf2dmp memdump.elf memdump.dmp

I tried twice more, hoping for better timing, but the results were the
same (haven't looked into why it segfaults yet). For the second one
there is no segfault, but still an error upon conversion:

> 2 CPU states has been found
> CPU #0 CR3 is 0x0000000000800000
> DirectoryTableBase = 0x0000000000000000 has been found from CPU #0 as interrupt handling CR3
> Failed to find paging base


For the VM with the spinning circles, the dump was converted
successfully at least, but I don't have any experience with WinDbg and
nothing interesting pops out to me:

> Microsoft (R) Windows Debugger Version 10.0.22621.1 AMD64
> Copyright (c) Microsoft Corporation. All rights reserved.
> 
> 
> Loading Dump File [F:\win-reboot-dump\memdump-circles.dmp]
> Kernel Complete Dump File: Full address space is available
> 
> Comment: 'Hello from elf2dmp!'
> Symbol search path is: srv*
> Executable search path is: 
> Windows 8.1 Kernel Version 9600 MP (2 procs) Free x64
> Product: Server, suite: TerminalServer SingleUserTS
> Edition build lab: 9600.19913.amd64fre.winblue_ltsb_escrow.201207-1920
> Machine Name:
> Kernel base = 0xfffff802`05073000 PsLoadedModuleList = 0xfffff802`053385d0
> System Uptime: 0 days 0:00:52.919
> Loading Kernel Symbols
> ...............................................................
> .....................................
> Loading User Symbols
> 
> Loading unloaded module list
> ..
> Unknown exception - code 00000000 (first/second chance not available)
> For analysis of this file, run !analyze -v
> 0: kd> ~0
> 0: kd> kb
>  # RetAddr               : Args to Child                                                           : Call Site
> 00 fffff802`0526a0ad     : ffffe002`00519650 ffffe002`00519410 00000000`00000000 fffff802`05354180 : hal!HalProcessorIdle+0xf
> 01 fffff802`05168a50     : fffff802`05354180 fffff802`066a2300 00000000`00000000 ffffe002`00519410 : nt!PpmIdleDefaultExecute+0x1d
> 02 fffff802`050dd186     : fffff802`05354180 fffff802`066a23cc fffff802`066a23d0 fffff802`066a23d8 : nt!PpmIdleExecuteTransition+0x400
> 03 fffff802`051b71ac     : fffff802`05354180 fffff802`05354180 fffff802`053bba00 00000000`00000000 : nt!PoIdle+0x2f6
> 04 00000000`00000000     : fffff802`066a3000 fffff802`0669c000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x2c
> 0: kd> r
> rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000086
> rdx=0000000000000000 rsi=00000000ffffffff rdi=ffffe00200519410
> rip=fffff8020501e81f rsp=fffff802066a2258 rbp=fffff802066a2339
>  r8=00000000ffffffff  r9=0000000000000000 r10=0000000000000002
> r11=0000000000000001 r12=0000000000000000 r13=0000000000000001
> r14=000000001f8de167 r15=0000000000000000
> iopl=0         nv up ei ng nz na pe nc
> cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00000282
> hal!HalProcessorIdle+0xf:
> fffff802`0501e81f c3              ret
> 0: kd> ~1
> 1: kd> kb
>  # RetAddr               : Args to Child                                                           : Call Site
> 00 fffff802`0526a0ad     : ffffe002`00521b20 ffffe002`005218e0 00000000`00000000 ffffd000`203da180 : hal!HalProcessorIdle+0xf
> 01 fffff802`05168a50     : ffffd000`203da180 ffffd000`203f8300 00000000`00000000 0000018e`b7f04213 : nt!PpmIdleDefaultExecute+0x1d
> 02 fffff802`050dd186     : ffffd000`203da180 ffffd000`203f83cc ffffd000`203f83d0 ffffd000`203f83d8 : nt!PpmIdleExecuteTransition+0x400
> 03 fffff802`051b71ac     : ffffd000`203da180 ffffd000`203da180 ffffd000`203ea300 00000000`00000000 : nt!PoIdle+0x2f6
> 04 00000000`00000000     : ffffd000`203f9000 ffffd000`203f2000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x2c
> 1: kd> r
> rax=0000000000000020 rbx=0000000000000000 rcx=0000000000000086
> rdx=0000000000000000 rsi=00000000ffffffff rdi=ffffe002005218e0
> rip=fffff8020501e81f rsp=ffffd000203f8258 rbp=ffffd000203f8339
>  r8=00000000ffffffff  r9=0000000000000000 r10=0000000000000002
> r11=0000000000000001 r12=0000000000000000 r13=0000000000000001
> r14=000128d624996edd r15=0000000000000000
> iopl=0         nv up ei ng nz na pe nc
> cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00000282
> hal!HalProcessorIdle+0xf:
> fffff802`0501e81f c3              ret

Is there anything I should be looking at in particular?

I took a second dump about 40 minutes later, but it essentially looks
the same.

Best regards,
Fiona

> 
> Best regards,
> Yan.
> 
> 
>> On 21 Jul 2022, at 3:49 PM, Fabian Ebner <f.ebner@proxmox.com> wrote:
>>
>> Hi,
>> since about half a year ago, we're getting user reports about guest
>> reboot issues with KVM/QEMU[0].
>>
>> The most common scenario is a Windows Server VM (2012R2/2016/2019,
>> UEFI/OVMF and SeaBIOS) getting stuck during the screen with the Windows
>> logo and the spinning circles after a reboot was triggered from within
>> the guest. Quitting the kvm process and booting with a fresh instance
>> works. The issue seems to become more likely, the longer the kvm
>> instance runs.
>>
>> We did not get such reports while we were providing Linux 5.4 and QEMU
>> 5.2.0, but we do with Linux 5.11/5.13/5.15 and QEMU 6.x.
>>
>> I'm just wondering if anybody has seen this issue before or might have a
>> hunch what it's about? Any tips on what to look out for when debugging
>> are also greatly appreciated!
>>
>> We do have debug access to a user's test VM and the VM state was saved
>> before a problematic reboot, but I can't modify the host system there.
>> AFAICT QEMU just executes guest code as usual, but I'm really not sure
>> what to look out for.
>>
>> That VM has CPU type host, and a colleague did have a similar enough CPU
>> to load the VM state, but for him, the reboot went through normally. On
>> the user's system, it triggers consistently after loading the VM state
>> and rebooting.
>>
>> So unfortunately, we didn't manage to reproduce the issue locally yet.
>> With two other images provided by users, we ran into a boot loop, where
>> QEMU resets the CPUs and does a few KVM_RUNs before the exit reason is
>> KVM_EXIT_SHUTDOWN (which to my understanding indicates a triple fault)
>> and then it repeats. It's not clear if the issues are related.
>>
>> There are also a few reports about non-Windows VMs, mostly Ubuntu 20.04
>> with UEFI/OVMF, but again, it's not clear if the issues are related.
>>
>> [0]: https://forum.proxmox.com/threads/100744/
>> (the forum thread is a bit chaotic unfortunately).
>>
>> Best Regards,
>> Fabi
>>
>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-08-02 11:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-21 12:49 Guest reboot issues since QEMU 6.0 and Linux 5.11 Fabian Ebner
2022-07-21 15:51 ` Maxim Levitsky
2022-07-22 12:28   ` Fiona Ebner
2022-07-28 10:13 ` Yan Vugenfirer
2022-08-02 11:57   ` Fiona Ebner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.