33 VCPUs in HVM guests with live migration with Linux hangs

* 33 VCPUs in HVM guests with live migration with Linux hangs
@ 2014-04-04 20:44 Konrad Rzeszutek Wilk
  2014-04-07  8:32 ` Ian Campbell
  0 siblings, 1 reply; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-04 20:44 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, david.vrabel

When live migrating I found out that if you try with more than 32 VCPUs
the guest is stuck. It works OK when booting - all 33 VCPUs show up.

I use this small config:

kernel = "hvmloader"
device_model_version = 'qemu-xen-traditional'
vcpus = 33
builder='hvm'
memory=1024
serial='file:/var/log/xen/console-bootstrap-x86_64-pvhvm'
name="m"
disk = [ 'file:/mnt/lab/bootstrap-x86_64/root_image.iso,hdc:cdrom,r','phy:/dev/guests/bootstrap-x86_64-pvhvm,xvda,w']
boot="dn"
vif = [ 'mac=00:0F:4B:00:00:68, bridge=switch' ]
vnc=1
vnclisten="0.0.0.0"
usb=1
usbdevice="tablet"

And do a migration:

m                                           33  1023    33     -b----      14.3
-bash-4.1# xl migrate m localhost
root@localhost's password: 
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x0/0x0/418)
Loading new save file <incoming migration stream> (new xl fmt info 0x0/0x0/418)
 Savefile contains xl domain config
WARNING: ignoring "kernel" directive for HVM guest. Use "firmware_override" instead if you really want a non-default firmware
libxl: notice: libxl_numa.c:494:libxl__get_numa_candidate: NUMA placement failed, performance might be affected
xc: Reloading memory pages: 262144/1045504   25%migration target: Transfer complete, requesting permission to start domain.
migration sender: Target has acknowledged transfer.
migration sender: Giving target permission to start.
migration target: Got permission, starting domain.
migration target: Domain started successsfully.
migration sender: Target reports successful startup.
Migration successful.         

Which completes succesfully.

xl vpcu-list tells me that all 32 VCPUs are blocked except one - 33rd.
Which seems to alternate between:

Call Trace:
                    [<ffffffff81128a5a>] multi_cpu_stop+0x9a <--
ffff8800365b5da0:   [<ffffffff811289c0>] multi_cpu_stop
ffff8800365b5dc0:   [<ffffffff811290aa>] cpu_stopper_thread+0x4a
ffff8800365b5de0:   [<ffffffff816f7081>] __schedule+0x381
ffff8800365b5e38:   [<ffffffff810cbf90>] smpboot_thread_fn
ffff8800365b5e80:   [<ffffffff810cc0d8>] smpboot_thread_fn+0x148
ffff8800365b5eb0:   [<ffffffff810cbf90>] smpboot_thread_fn
ffff8800365b5ec0:   [<ffffffff810c498e>] kthread+0xce
ffff8800365b5f28:   [<ffffffff810c48c0>] kthread
ffff8800365b5f50:   [<ffffffff81703e0c>] ret_from_fork+0x7c
ffff8800365b5f80:   [<ffffffff810c48c0>] kthread

and
                   [<ffffffff8108f3c8>] pvclock_clocksource_read+0x18 <--
ffff880038603ef0:   [<ffffffff81045698>] xen_clocksource_read+0x28
ffff880038603f00:   [<ffffffff81057909>] sched_clock+0x9
ffff880038603f10:   [<ffffffff810d7b85>] sched_clock_local+0x25
ffff880038603f40:   [<ffffffff810d7ca8>] sched_clock_cpu+0xb8
ffff880038603f60:   [<ffffffff810d840e>] irqtime_account_irq+0x4e
ffff880038603f80:   [<ffffffff810a5279>] irq_enter+0x39
ffff880038603f90:   [<ffffffff813f8480>] xen_evtchn_do_upcall+0x20
ffff880038603fb0:   [<ffffffff817058ed>] xen_hvm_callback_vector+0x6d

which implies that the CPU is receiving interrupts, but somehow
is in thread doing something.. probably waiting for an mutex.

When the CPU (33) started (before migration) this was its stack:
Call Trace:
                    [<ffffffff8108e846>] native_safe_halt+0x6 <--
ffff8800377d1e90:   [<ffffffff8105989a>] default_idle+0x1a
ffff8800377d1eb0:   [<ffffffff810591f6>] arch_cpu_idle+0x26
ffff8800377d1ec0:   [<ffffffff810f6d76>] cpu_startup_entry+0xa6
ffff8800377d1ef0:   [<ffffffff81109a55>] clockevents_register_device+0x105
ffff8800377d1f30:   [<ffffffff8108236e>] start_secondary+0x19e

The only culprit I could think of was commit d5b17dbff83d63fb6bf35daec21c8ebfb8d695b5
    "xen/smp/pvhvm: Don't point per_cpu(xen_vpcu, 33 and larger) to shared_info"

which I had reverted - but that did not help.

So questions:

1) Had anybody else actually booted HVM guests with more than 32 VCPUs
   and tried to migrate?

2) If yes, had you seen this before?

^ permalink raw reply	[flat|nested] 36+ messages in thread