5.2.11+ Regression: > nproc/2 lockups during initramfs

* 5.2.11+ Regression: > nproc/2 lockups during initramfs
@ 2019-09-08 10:37 James Harvey
  2019-09-10 18:32 ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: James Harvey @ 2019-09-08 10:37 UTC (permalink / raw)
  To: kvm

Host is up to date Arch Linux, with exception of downgrading linux to
track this down to 5.2.11 - 5.2.13.  QEMU 4.1.0, but have also
downgraded to 4.0.0 to confirm no change.

Host is dual E5-2690 v1 Xeons.  With hyperthreading, 32 logical cores.
I've always been able to boot qemu with "-smp
cpus=30,cores=15,threads=1,sockets=2".  I leave 2 free for host
responsiveness.

Upgrading from 5.2.10 to 5.2.11 causes the VM to lock up while loading
the initramfs about 90-95% of the time.  (Probably a slight race
condition.)  On host, QEMU shows as nVmCPUs*100% CPU usage, so around
3000% for 30 cpus.

If I back down to "cpus=16,cores=8", it always boots.  If I increase
to "cpus=18,cores=9", it goes back to locking up 90-95% of the time.

Omitting "-accel=kvm" allows 5.2.11 to work on the host without issue,
so combined with that the only package needing to be downgraded is
linux to 5.2.10 to prevent the issue with KVM, I think this must be a
KVM issue.

Using version of QEMU with debug symbols gives:
* gdb backtrace: http://ix.io/1UyO
* 11 seconds of attaching strace to locked up qemu (167K): http://ix.io/1UyP
* strace from the beginning of starting a qemu that locks up (8MB):
https://filebin.ca/4uI15ztGAarw/strace.qemu.from.start
** This definitely changed timings, and it became harder to replicate,
to where I'd guess 20-30% of boots hang
** Interestingly, the strace only collected data for 5 seconds, even
though qemu continued at full CPU usage much longer.  Don't know what
to make of that, especially because the first strace was attached to
an already locked up qemu that had gone well past 5 seconds.

Like how the strace changed timings, I have seen attaching GDB to a
running qemu which pauses it, then simply running continue, has gotten
it "unstuck" immediately.

I've let this go 14 hours, but once it goes into complete CPU usage,
it never comes out.

If booting from the September 2019 Arch ISO, it hangs right after the
ISO's UEFI bootloader selects Arch Linux, then the screen goes black.

If booting from grub/systemd, it hangs right after "Loading Initial Ramdisk..."

^ permalink raw reply	[flat|nested] 5+ messages in thread