qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* Migration failure when running nested VMs
@ 2019-09-20 19:01 Jintack Lim
  2019-09-23 10:42 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 6+ messages in thread
From: Jintack Lim @ 2019-09-20 19:01 UTC (permalink / raw)
  To: QEMU Devel Mailing List

Hi,

I'm seeing VM live migration failure when a VM is running a nested VM.
I'm using latest Linux kernel (v5.3) and QEMU (v4.1.0). I also tried
v5.2, but the result was the same. Kernel versions in L1 and L2 VM are
v4.18, but I don't think that matters.

The symptom is that L2 VM kernel crashes in different places after
migration but the call stack is mostly related to memory management
like [1] and [2]. The kernel crash happens almost all the time. While
L2 VM gets kernel panic, L1 VM runs fine after the migration. Both L1
and L2 VM were doing nothing during migration.

I found a few clues about this issue.
1) It happens with a relatively large memory for L1 (24G), but it does
not with a smaller size (3G).

2) Dead migration worked; when I ran "stop" command in the qemu
monitor for L1 first and did migration, migration worked always. It
also worked when I only stopped L2 VM and kept L1 live during the
migration.

With those two clues, I guess maybe some dirty pages made by L2 are
not transferred to the destination correctly, but I'm not really sure.

3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on
Intel(R) Xeon(R) CPU E5-2630 v3 CPU.

This makes me confused because I thought migrating nested state
doesn't depend on the underlying hardware.. Anyways, L1-only migration
with the large memory size (24G) works on both CPUs without any
problem.

I would appreciate any comments/suggestions to fix this problem.

Thanks,
Jintack


[1]https://paste.ubuntu.com/p/XGDKH45yt4/
[2]https://paste.ubuntu.com/p/CpbVTXJCyc/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Migration failure when running nested VMs
  2019-09-20 19:01 Migration failure when running nested VMs Jintack Lim
@ 2019-09-23 10:42 ` Dr. David Alan Gilbert
  2019-09-23 11:48   ` Paolo Bonzini
  2019-09-23 18:32   ` Jintack Lim
  0 siblings, 2 replies; 6+ messages in thread
From: Dr. David Alan Gilbert @ 2019-09-23 10:42 UTC (permalink / raw)
  To: Jintack Lim, pbonzini; +Cc: QEMU Devel Mailing List

* Jintack Lim (incredible.tack@gmail.com) wrote:
> Hi,

Copying in Paolo, since he recently did work to fix nested migration -
it was expected to be broken until pretty recently; but 4.1.0 qemu on
5.3 kernel is pretty new, so I think I'd expected it to work.

> I'm seeing VM live migration failure when a VM is running a nested VM.
> I'm using latest Linux kernel (v5.3) and QEMU (v4.1.0). I also tried
> v5.2, but the result was the same. Kernel versions in L1 and L2 VM are
> v4.18, but I don't think that matters.
> 
> The symptom is that L2 VM kernel crashes in different places after
> migration but the call stack is mostly related to memory management
> like [1] and [2]. The kernel crash happens almost all the time. While
> L2 VM gets kernel panic, L1 VM runs fine after the migration. Both L1
> and L2 VM were doing nothing during migration.
> 
> I found a few clues about this issue.
> 1) It happens with a relatively large memory for L1 (24G), but it does
> not with a smaller size (3G).
> 
> 2) Dead migration worked; when I ran "stop" command in the qemu
> monitor for L1 first and did migration, migration worked always. It
> also worked when I only stopped L2 VM and kept L1 live during the
> migration.
> 
> With those two clues, I guess maybe some dirty pages made by L2 are
> not transferred to the destination correctly, but I'm not really sure.
> 
> 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on
> Intel(R) Xeon(R) CPU E5-2630 v3 CPU.
> 
> This makes me confused because I thought migrating nested state
> doesn't depend on the underlying hardware.. Anyways, L1-only migration
> with the large memory size (24G) works on both CPUs without any
> problem.
> 
> I would appreciate any comments/suggestions to fix this problem.

Can you share the qemu command lines you're using for both L1 and L2
please ?
Are there any dmesg entries around the time of the migration on either
the hosts or the L1 VMs?
What guest OS are you running in L1 and L2?

Dave

> Thanks,
> Jintack
> 
> 
> [1]https://paste.ubuntu.com/p/XGDKH45yt4/
> [2]https://paste.ubuntu.com/p/CpbVTXJCyc/
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Migration failure when running nested VMs
  2019-09-23 10:42 ` Dr. David Alan Gilbert
@ 2019-09-23 11:48   ` Paolo Bonzini
  2019-09-23 18:32     ` Jintack Lim
  2019-09-23 18:32   ` Jintack Lim
  1 sibling, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2019-09-23 11:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Jintack Lim; +Cc: QEMU Devel Mailing List

On 23/09/19 12:42, Dr. David Alan Gilbert wrote:
> 
> With those two clues, I guess maybe some dirty pages made by L2 are
> not transferred to the destination correctly, but I'm not really sure.
> 
> 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on
> Intel(R) Xeon(R) CPU E5-2630 v3 CPU.

Hmm, try disabling pml (kvm_intel.pml=0).  This would be the main
difference, memory-management wise, between those two machines.

Paolo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Migration failure when running nested VMs
  2019-09-23 10:42 ` Dr. David Alan Gilbert
  2019-09-23 11:48   ` Paolo Bonzini
@ 2019-09-23 18:32   ` Jintack Lim
  1 sibling, 0 replies; 6+ messages in thread
From: Jintack Lim @ 2019-09-23 18:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, QEMU Devel Mailing List

On Mon, Sep 23, 2019 at 3:42 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Jintack Lim (incredible.tack@gmail.com) wrote:
> > Hi,
>
> Copying in Paolo, since he recently did work to fix nested migration -
> it was expected to be broken until pretty recently; but 4.1.0 qemu on
> 5.3 kernel is pretty new, so I think I'd expected it to work.
>

Thank you, Dave. What Paolo proposed make migration work!

> > I'm seeing VM live migration failure when a VM is running a nested VM.
> > I'm using latest Linux kernel (v5.3) and QEMU (v4.1.0). I also tried
> > v5.2, but the result was the same. Kernel versions in L1 and L2 VM are
> > v4.18, but I don't think that matters.
> >
> > The symptom is that L2 VM kernel crashes in different places after
> > migration but the call stack is mostly related to memory management
> > like [1] and [2]. The kernel crash happens almost all the time. While
> > L2 VM gets kernel panic, L1 VM runs fine after the migration. Both L1
> > and L2 VM were doing nothing during migration.
> >
> > I found a few clues about this issue.
> > 1) It happens with a relatively large memory for L1 (24G), but it does
> > not with a smaller size (3G).
> >
> > 2) Dead migration worked; when I ran "stop" command in the qemu
> > monitor for L1 first and did migration, migration worked always. It
> > also worked when I only stopped L2 VM and kept L1 live during the
> > migration.
> >
> > With those two clues, I guess maybe some dirty pages made by L2 are
> > not transferred to the destination correctly, but I'm not really sure.
> >
> > 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on
> > Intel(R) Xeon(R) CPU E5-2630 v3 CPU.
> >
> > This makes me confused because I thought migrating nested state
> > doesn't depend on the underlying hardware.. Anyways, L1-only migration
> > with the large memory size (24G) works on both CPUs without any
> > problem.
> >
> > I would appreciate any comments/suggestions to fix this problem.
>
> Can you share the qemu command lines you're using for both L1 and L2
> please ?

Sure. I use the same QEMU command line for L1 and L2 except for cpu
and memory allocation.

This is the one for running L1, and I use smaller cpu and memory size for L2.
./qemu/x86_64-softmmu/qemu-system-x86_64 -smp 6 -m 24G -M
q35,accel=kvm -cpu host -drive
if=none,file=/vm_nfs/guest0.img,id=vda,cache=none,format=raw -device
virtio-blk-pci,drive=vda --nographic -qmp
unix:/var/run/qmp,server,wait -serial mon:stdio -netdev
user,id=net0,hostfwd=tcp::2222-:22 -device
virtio-net-pci,netdev=net0,mac=de:ad:be:ef:f2:12 -netdev
tap,id=net1,vhost=on,helper=/srv/vm/qemu/qemu-bridge-helper -device
virtio-net-pci,netdev=net1,disable-modern=off,disable-legacy=on,mac=de:ad:be:ef:f2:11
-monitor telnet:127.0.0.1:4444,server,nowait

> Are there any dmesg entries around the time of the migration on either
> the hosts or the L1 VMs?

No, I didn't see anything special in L0 or L1 kernel log.

> What guest OS are you running in L1 and L2?
>

I'm using Linux v4.18 both in L1 and L2.

Thanks,
Jintack

> Dave
>
> > Thanks,
> > Jintack
> >
> >
> > [1]https://paste.ubuntu.com/p/XGDKH45yt4/
> > [2]https://paste.ubuntu.com/p/CpbVTXJCyc/
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Migration failure when running nested VMs
  2019-09-23 11:48   ` Paolo Bonzini
@ 2019-09-23 18:32     ` Jintack Lim
  2019-09-24  0:19       ` Paolo Bonzini
  0 siblings, 1 reply; 6+ messages in thread
From: Jintack Lim @ 2019-09-23 18:32 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Dr. David Alan Gilbert, QEMU Devel Mailing List

On Mon, Sep 23, 2019 at 4:48 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 23/09/19 12:42, Dr. David Alan Gilbert wrote:
> >
> > With those two clues, I guess maybe some dirty pages made by L2 are
> > not transferred to the destination correctly, but I'm not really sure.
> >
> > 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on
> > Intel(R) Xeon(R) CPU E5-2630 v3 CPU.
>
> Hmm, try disabling pml (kvm_intel.pml=0).  This would be the main
> difference, memory-management wise, between those two machines.
>

Thank you, Paolo.

This makes migration work successfully over 20 times in a row on
Intel(R) Xeon(R) Silver 4114 CPU where migration failed almost always
without disabling pml.

I guess there's a problem in KVM pml code? I'm fine with disabling
pml. But if you have patches to fix the issue, I'm willing to test it
on the CPU.

Thanks,
Jintack

> Paolo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Migration failure when running nested VMs
  2019-09-23 18:32     ` Jintack Lim
@ 2019-09-24  0:19       ` Paolo Bonzini
  0 siblings, 0 replies; 6+ messages in thread
From: Paolo Bonzini @ 2019-09-24  0:19 UTC (permalink / raw)
  To: Jintack Lim; +Cc: Dr. David Alan Gilbert, QEMU Devel Mailing List

On 23/09/19 20:32, Jintack Lim wrote:
> On Mon, Sep 23, 2019 at 4:48 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 23/09/19 12:42, Dr. David Alan Gilbert wrote:
>>>
>>> With those two clues, I guess maybe some dirty pages made by L2 are
>>> not transferred to the destination correctly, but I'm not really sure.
>>>
>>> 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on
>>> Intel(R) Xeon(R) CPU E5-2630 v3 CPU.
>>
>> Hmm, try disabling pml (kvm_intel.pml=0).  This would be the main
>> difference, memory-management wise, between those two machines.
>>
> 
> Thank you, Paolo.
> 
> This makes migration work successfully over 20 times in a row on
> Intel(R) Xeon(R) Silver 4114 CPU where migration failed almost always
> without disabling pml.
> 
> I guess there's a problem in KVM pml code? I'm fine with disabling
> pml. But if you have patches to fix the issue, I'm willing to test it
> on the CPU.

Yes, it's a known bug in the PML code (that I thought was not an issue
for migration, but I was wrong).  I'll try to get you a patch this week.

Paolo


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-09-24  0:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-20 19:01 Migration failure when running nested VMs Jintack Lim
2019-09-23 10:42 ` Dr. David Alan Gilbert
2019-09-23 11:48   ` Paolo Bonzini
2019-09-23 18:32     ` Jintack Lim
2019-09-24  0:19       ` Paolo Bonzini
2019-09-23 18:32   ` Jintack Lim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).