Live migration locks up 3.2 guests in do_timer(ticks ~ 500000)

* Live migration locks up 3.2 guests in do_timer(ticks ~ 500000)
@ 2014-09-08  5:54 Matt Mullins
  2014-09-08  8:18 ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Matt Mullins @ 2014-09-08  5:54 UTC (permalink / raw)
  To: kvm

Somewhere between kernel 3.2 and 3.11 on my VM hosts (yes, I know that narrows
it down a /whole lot/ ...), live migration started killing my Ubuntu precise
(kernel 3.2.x) guests, causing all of their vcpus to go into a busy loop.  Once
(and only once) I've observed the guest eventually becoming responsive again,
with a clock nearly 600 years in the future and a negative uptime.

I haven't been able to dig up any previous threads about this problem, so my
gut instinct is that I've configured something wonky.  Any pointers toward
/what/ I may have done wrong are appreciated.

It only seems to happen if I've given the guests Nehalem-class CPU features.
My longest-running VMs, from before I started passing-through the CPU
capabilities into the guest, seem to migrate without issue.

It also seems to happen reliably when the guest has been running for a while;
it's easily reproducible with guests that have been up ~1 day, and I've
reproduced it in VMs with an uptime of ~20 hours.  I haven't yet figured out a
lower-bound, which makes the testing cycle a little longer for me.

The guests that I reliably reproduce this on are Ubuntu 12.04 guests running
the current 3.2 kernel that Canonical distributes.  Recent Fedora kernels
(3.14+, IIRC) don't seem to busy-spin this way, though I haven't tested this
case exhaustively, and I haven't written down very good notes for the tests I
have done with Fedora.

The hosts are dual-socket Nehalem Xeons (L5520), currently running Ubuntu 14.04
and the associated 3.13 kernel.  I had previously reproduced this with 12.04
running a raring-backport 3.11 kernel as well, but I (seemingly erroneously)
assumed it may have been a qemu userspace discrepancy.

I have been poring through a debugger attached to the guest via qemu's
gdbserver after it gets sent in a busy-spin, and the stack trace is:

(gdb) bt
#0  second_overflow (secs=<optimized out>) at /build/buildd/linux-3.2.0/kernel/time/ntp.c:407
#1  0xffffffff81095c75 in logarithmic_accumulation (offset=3831765322649889943, shift=9) at /build/buildd/linux-3.2.0/kernel/time/timekeeping.c:987
#2  0xffffffff81096042 in update_wall_time () at /build/buildd/linux-3.2.0/kernel/time/timekeeping.c:1056
#3  0xffffffff81096e8d in do_timer (ticks=549606) at /build/buildd/linux-3.2.0/kernel/time/timekeeping.c:1246
#4  0xffffffff8109d825 in tick_do_update_jiffies64 (now=...) at /build/buildd/linux-3.2.0/kernel/time/tick-sched.c:77
#5  0xffffffff8109dda6 in tick_nohz_update_jiffies (now=...) at /build/buildd/linux-3.2.0/kernel/time/tick-sched.c:145
#6  0xffffffff8109e378 in tick_check_nohz (cpu=0) at /build/buildd/linux-3.2.0/kernel/time/tick-sched.c:713
#7  tick_check_idle (cpu=0) at /build/buildd/linux-3.2.0/kernel/time/tick-sched.c:731
#8  0xffffffff8106ff91 in irq_enter () at /build/buildd/linux-3.2.0/kernel/softirq.c:306
#9  0xffffffff8166cef3 in smp_apic_timer_interrupt (regs=<optimized out>) at /build/buildd/linux-3.2.0/arch/x86/kernel/apic/apic.c:880
#10 <signal handler called>
#11 0xffffffffffffff10 in ?? ()
(gdb) thread 2
[Switching to thread 2 (Thread 2)]
#0  read_seqbegin (sl=<optimized out>) at /build/buildd/linux-3.2.0/include/linux/seqlock.h:89
89      /build/buildd/linux-3.2.0/include/linux/seqlock.h: No such file or directory.
(gdb) bt
#0  read_seqbegin (sl=<optimized out>) at /build/buildd/linux-3.2.0/include/linux/seqlock.h:89
#1  ktime_get () at /build/buildd/linux-3.2.0/kernel/time/timekeeping.c:268
#2  0xffffffff8109e355 in tick_check_nohz (cpu=1) at /build/buildd/linux-3.2.0/kernel/time/tick-sched.c:709
#3  tick_check_idle (cpu=1) at /build/buildd/linux-3.2.0/kernel/time/tick-sched.c:731
#4  0xffffffff8106ff91 in irq_enter () at /build/buildd/linux-3.2.0/kernel/softirq.c:306
#5  0xffffffff8166cef3 in smp_apic_timer_interrupt (regs=<optimized out>) at /build/buildd/linux-3.2.0/arch/x86/kernel/apic/apic.c:880
#6  <signal handler called>
#7  0xffffffffffffff10 in ?? ()

If I continue, then re-stop the guest, logarithmic_accumulation() is still in
the stacktrace, with the same offset and shift; the line numbers indicate it's
stuck in the following loop:
    while (timekeeper.xtime_nsec >= nsecps) {
        int leap;
        timekeeper.xtime_nsec -= nsecps;
        xtime.tv_sec++;
        leap = second_overflow(xtime.tv_sec);
        xtime.tv_sec += leap;
        wall_to_monotonic.tv_sec -= leap;
        if (leap)
            clock_was_set_delayed();
    }

Live migration is initiated through libvirt by virDomainMigrate with
flags=VIR_MIGRATE_LIVE, uri="tcp://$recv_hostname".

The guest is spawned by libvirtd with:
qemu-system-x86_64 -enable-kvm -name dog -S 
-machine pc-i440fx-trusty,accel=kvm,usb=off 
-cpu Nehalem,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-m 512 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 
-uuid 55fd4c19-2477-40a5-988f-aaccd60b20dc -no-user-config -nodefaults 
-chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/dog.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
-boot menu=on,strict=on
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
-drive if=none,id=drive-ide0-1-0,readonly=on,format=raw
-device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1
-drive file=rbd:rbd/dog:id=libvirt:key=________________________________________:auth_supported=cephx\;none,if=none,id=drive-virtio-disk0,format=raw,cache=none
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2
-netdev tap,ifname=vm9_0,script=no,id=hostnet0,vhost=on,vhostfd=26
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:16:3e:62:7a:9d,bus=pci.0,addr=0x3
-vnc 0.0.0.0:9,password
-device cirrus-vga,id=video0,bus=pci.0,addr=0x2
-incoming tcp:[::]:49152
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 

The libvirt domain XML is:
<domain type='kvm' id='12'>
  <name>dog</name>
  <uuid>55fd4c19-2477-40a5-988f-aaccd60b20dc</uuid>
  <memory unit='KiB'>524288</memory>
  <currentMemory unit='KiB'>524288</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-trusty'>hvm</type>
    <bootmenu enable='yes'/>
  </os>
  <features>
    <acpi/>
  </features>
  <cpu mode='custom' match='exact'>
    <model fallback='allow'>Nehalem</model>
    <feature policy='require' name='dca'/>
    <feature policy='require' name='xtpr'/>
    <feature policy='require' name='tm2'/>
    <feature policy='require' name='est'/>
    <feature policy='require' name='vmx'/>
    <feature policy='require' name='ds_cpl'/>
    <feature policy='require' name='monitor'/>
    <feature policy='require' name='pbe'/>
    <feature policy='require' name='tm'/>
    <feature policy='require' name='ht'/>
    <feature policy='require' name='ss'/>
    <feature policy='require' name='acpi'/>
    <feature policy='require' name='ds'/>
    <feature policy='require' name='vme'/>
  </cpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/kvm-spice</emulator>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <boot order='1'/>
      <alias name='ide0-1-0'/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>
    <disk type='network' device='disk' snapshot='no'>
      <driver name='qemu' type='raw' cache='none'/>
      <auth username='libvirt'>
        <secret type='ceph' uuid='e04aa789-0bd7-07ac-cf10-78d8f52a4162'/>
      </auth>
      <source protocol='rbd' name='rbd/dog'/>
      <target dev='vda' bus='virtio'/>
      <boot order='2'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='usb' index='0'>
      <alias name='usb0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <interface type='ethernet'>
      <mac address='00:16:3e:62:7a:9d'/>
      <script path='no'/>
      <target dev='vm9_0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='5909' autoport='no' listen='0.0.0.0'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='none'/>
</domain>

^ permalink raw reply	[flat|nested] 12+ messages in thread